AlexIoannides / elasticsearchr

Lightweight Elasticsearch client for R.
https://alexioannides.com/2016/11/28/elasticsearchr-a-lightweight-elasticsearch-client-for-r/
54 stars 19 forks source link

Error in search query: rbind numbers of columns of arguments do not match #28

Closed andrewkho closed 7 years ago

andrewkho commented 7 years ago

When using the %search% operator, the method fails with the following error message:

............Error in rbind(deparse.level, ...) number of columns of arguments do not match

Elasticsearchr is from CRAN

I have confirmed the query works with the elastic package. I suspect the document have differing number of fields and perhaps this is causing the issue.

Or on the other hand it could be because of the elastic search version: 5.5.1

AlexIoannides commented 7 years ago

Hi Andrew,

Are you able to provide/create an example dataset that reproduces the error you're seeing?

I don't see why ES v5.5.1 should be an issue. Likewise a differing number of fields as missing/empty fields in an index should could through as NA.

Like I said, if you can reproduce the error for me that would be an enormous help.

Alex

andrewkho commented 7 years ago

Hi Alex, I'll try and create a minimum reproducible example, hopefully soon.

AlexIoannides commented 7 years ago

Thanks Andrew.

AlexIoannides commented 7 years ago

@andrewkho Any luck with that example or can I close this issue?

andrewkho commented 7 years ago

Sorry I haven't been able to make a small example. It is still an issue, however I am working around it by using the plain "elastic" package and wrote a simple DSL which is doing the job, so unfortunately I am not using elasticsearchr.

AlexIoannides commented 7 years ago

Thanks for getting back to me. In the absence of an example to debug, I'm going to close this issue. I'll re-open it if I run into anything that sounds similar.

jwarnes commented 6 years ago

I am also having this issue while importing a dataset of ~56k rows from elasticsearch

hatdropper1977 commented 6 years ago

@andrewkho and @jwarnes, I ran into similar problems due to the nature of trying to wrangle nested lists into a data frame. This blog post really helps: http://zevross.com/blog/2015/02/12/using-r-to-download-and-parse-json-an-example-using-data-from-an-open-data-portal/

AlexIoannides commented 6 years ago

@jwarnes @hatdropper1977 Can either of you provide me with an example document or two, that I can ingest into Elasticsearch, to use for debugging?

Hypothetically, nested data frames shouldn't be an issue, as they ought to be 'flattened' using the flatten function from jsonlite.

hatdropper1977 commented 6 years ago

@AlexIoannides I can't give you my data bc they include sensitive infomation.

Maybe these will work? https://download.elastic.co/demos/kibana/gettingstarted/logs.jsonl.gz

I did not have any luck w/ flatten, but ymmv.

AlexIoannides commented 6 years ago

Are you not able to create an artificial record that replicates the error you're observing?

I'm sorry, but I don't have time for trial and error. If you can give me something to pin a target on, however, I will make the time to take a look.

hatdropper1977 commented 6 years ago

Ok - it may be a while though. @jwarnes do you have example data?

hatdropper1977 commented 6 years ago

Not at the moment.

On Wed, May 23, 2018 at 4:58 PM, Alex Ioannides notifications@github.com wrote:

Are you not able to create an artificial record that replicates the error you're observing?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/AlexIoannides/elasticsearchr/issues/28#issuecomment-391423153, or mute the thread https://github.com/notifications/unsubscribe-auth/AC-n9F2CVs1G3Kn5e0Ho2iZmTppDRdZmks5t1ZVGgaJpZM4PdCSD .

hatdropper1977 commented 6 years ago

To clarify - I have no issue w/ the %search% command. If returns a data frame which includes columns of nested lists. The issue w/ my data is that it includes arbirtray levels and sub-levels of lists (each which may or may not have consistent length). So a simple 'flatten' does not work on it. I use the technique in the blog I linked above to pull the data I want.

Sorry for the confusion.

Once again, ElasticsearchR does its job very well.

If this helps, here is an example aggs query that works w/ ElasticsearchR (in that it succesfully returns a data frame - 100k plus rows w/o issue).

match_all_query <- query('{
                    "match_all": {}
                    }')

date_hist_w_servers <- aggs('{
      "docs_over_time": {
                "date_histogram": {
                    "field": "@timestamp",
                    "interval": "1m"
                },
                "aggs" : {
                  "servers" : {
                     "terms" : { "field" : "beat.hostname.keyword", "size" : 20 },
                     "aggs"  : {
                          "the_max": {"max" : { "field": "system.memory.free" } }                
                     }
                  }
                }
      }}')

df <- elastic(ELASTIC_API, ELASTIC_INDEX_NAME) %search% (match_all_query + date_hist_w_servers)

I just need to apply my own logic to flattening it, because it returns inconsistent nested lists (in terms of length and further sub-nests).

AlexIoannides commented 6 years ago

Thanks John.

How does this relate (if at all) to the original error, ............Error in rbind(deparse.level, ...) ?

I'm struggling to 'join the dots' here.

MonaxGT commented 5 years ago

Hi ! Today i have exact this error when i am trying to download data from es. My index hasn't identical docs in index.

> elastic("http://elasticsearch:9200", "logs", "doc") %search% for_everything
...................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................Error in rbind(deparse.level, ...) : 
  numbers of columns of arguments do not match

I should use one index for one type of docs with similar fields?

AlexIoannides commented 5 years ago

Hi @MonaxGT,

Your index may not have identical docs, but that shouldn't (in theory) be a problem as all the interesting fields will have the same type.

To help me debug this, could you please give a few example docs from logs/doc (that are not 'identical')?

1beb commented 5 years ago

Hi Alex,

Also having this problem. The issue is that rbind enforces all bound columns to have the same names. This will happen anytime sparse data is held in the index. There are two issues here that I think are problematic:

  1. Indexes with sparse data will fail. A toy example reproducing this error would be something like:
l = list
l$record1 = data.frame(a=NA, b=NA, c=NA)
l$record2 = data.frame(a=NA, b=NA) # no data for col `c`

do.call(rbind, l) # line 404 of utils.R is the only instance of do.call(rbind, list)
Error in rbind(deparse.level, ...) : 
  numbers of columns of arguments do not match
  1. If you use elasticsearchr to create an index, when you %index% an item that has an NA, that field does not get pushed for the record. So future %search% actions on this index could lead to the do.call(rbind, list) error.

The code that manages this is here: https://github.com/AlexIoannides/elasticsearchr/blob/77ccadcc2a14fc834e0233b0c3bbc5496d7c90b7/R/utils.R#L404

Possible solutions.

  1. Use dplyr::bind_rows:
bind_rows(l)
   a  b  c
1 NA NA NA
2 NA NA NA
  1. Use data.table::rbindlist(l, fill = TRUE)
data.table::rbindlist(l, fill=TRUE)
    a  b  c
1: NA NA NA
2: NA NA NA

I'm going to submit a PR using dplyr solution as per 1, above.

AlexIoannides commented 5 years ago

This fix has been merged in #52 and submitted to CRAN as v0.3.1.