elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
69.62k stars 24.64k forks source link

If _id field is an object, no error is thrown but doc is "unsearchable" #3517

Closed polyfractal closed 8 years ago

polyfractal commented 11 years ago

Expected Behavior

Normally, if you try to index a document without an ID in the URI (e.g. a POST) but with an _id field in the document (and no explicit _id path mapping), it throws an error because the autogenerated ID does not match the provided _id field:

curl -XDELETE localhost:9200/testindex
curl -XPUT localhost:9200/testindex
curl -XPOST localhost:9200/testindex/testtype?pretty -d '{"_id":"polyfractal","key":"value"}}}'
{
  "error" : "MapperParsingException[failed to parse [_id]]; nested: MapperParsingException[Provided id [O-kIgieVTRG9DpxHML7LkA] does not match the content one [polyfractal]]; ",
  "status" : 400
}

Broken Behavior

However, if the _id field happens to be an object, Elasticsearch happily indexes the document:

curl -XDELETE localhost:9200/testindex
curl -XPUT localhost:9200/testindex
curl -XPOST "localhost:9200/testindex/testtype" -d '{"key":"value"}'
curl -XPOST "localhost:9200/testindex/testtype" -d '{"_id":{"name":"polyfractal"},"key":"value"}}}'
{"ok":true,"_index":"testindex","_type":"testtype","_id":"b2xEPk5tTfC-RLsCb1ZapA","_version":1}
{"ok":true,"_index":"testindex","_type":"testtype","_id":"BsTbRqaeTrKLIe0JoeHsWw","_version":1}

You can GET it:

curl -XGET localhost:9200/testindex/testtype/BsTbRqaeTrKLIe0JoeHsWw?pretty
{
  "_index" : "testindex",
  "_type" : "testtype",
  "_id" : "BsTbRqaeTrKLIe0JoeHsWw",
  "_version" : 1,
  "exists" : true, "_source" : {"_id":{"name":"polyfractal"},"key":"value"}}}
}

It shows up with a match_all query:

curl -XGET localhost:9200/testindex/testtype/_search?pretty -d '{"query":{"match_all":{}}}'
{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 2,
    "max_score" : 1.0,
    "hits" : [ {
      "_index" : "testindex",
      "_type" : "testtype",
      "_id" : "BsTbRqaeTrKLIe0JoeHsWw",
      "_score" : 1.0, "_source" : {"_id":{"name":"polyfractal"},"key":"value"}}}
    }, {
      "_index" : "testindex",
      "_type" : "testtype",
      "_id" : "b2xEPk5tTfC-RLsCb1ZapA",
      "_score" : 1.0, "_source" : {"key":"value"}
    } ]
  }
}

But doesn't show up when you search for exact values (or Match or any other search):

curl -XGET localhost:9200/testindex/testtype/_search?pretty -d '{"query":{"term":{"key":"value"}}}'
{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 0.30685282,
    "hits" : [ {
      "_index" : "testindex",
      "_type" : "testtype",
      "_id" : "b2xEPk5tTfC-RLsCb1ZapA",
      "_score" : 0.30685282, "_source" : {"key":"value"}
    } ]
  }
}

If you ask ES why it doesn't show up, it says there are no matching terms:

curl -XGET localhost:9200/testindex/testtype/BsTbRqaeTrKLIe0JoeHsWw/_explain?pretty -d '{"query":{"term":{"key":"value"}}}'
{
  "ok" : true,
  "_index" : "testindex",
  "_type" : "testtype",
  "_id" : "BsTbRqaeTrKLIe0JoeHsWw",
  "matched" : false,
  "explanation" : {
    "value" : 0.0,
    "description" : "no matching term"
  }
}

And finally, as a fun twist, you can set an explicit mapping to look inside the _id object. This works with regard to the ID (it extracts the appropriate ID), is GETable, match_all, etc. Search is still broken.

curl -XDELETE localhost:9200/testindex
curl -XPUT localhost:9200/testindex -d '{
   "mappings":{
      "testtype":{
         "_id" : {
           "path" : "_id.name"
         },
         "properties":{
            "_id":{
               "type":"object",
               "properties":{
                  "name":{
                     "type":"string"
                  }
               }
            }
         }
      }
   }
}'

curl -XPOST "localhost:9200/testindex/testtype" -d '{"key":"value"}'
curl -XPOST "localhost:9200/testindex/testtype" -d '{"_id":{"name":"polyfractal"},"key":"value"}}}'
curl -XGET localhost:9200/testindex/testtype/polyfractal?pretty
{
  "_index" : "testindex",
  "_type" : "testtype",
  "_id" : "polyfractal",
  "_version" : 1,
  "exists" : true, "_source" : {"_id":{"name":"polyfractal"},"key":"value"}}}
}
curl -XGET localhost:9200/testindex/testtype/_search?pretty -d '{"query":{"match_all":{}}}'
{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 2,
    "max_score" : 1.0,
    "hits" : [ {
      "_index" : "testindex",
      "_type" : "testtype",
      "_id" : "wsT9vaevTCW5EuKyr7nmUw",
      "_score" : 1.0, "_source" : {"key":"value"}
    }, {
      "_index" : "testindex",
      "_type" : "testtype",
      "_id" : "polyfractal",
      "_score" : 1.0, "_source" : {"_id":{"name":"polyfractal"},"key":"value"}}}
    } ]
  }
}
curl -XGET localhost:9200/testindex/testtype/_search?pretty -d '{"query":{"term":{"key":"value"}}}'
{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 0.30685282,
    "hits" : [ {
      "_index" : "testindex",
      "_type" : "testtype",
      "_id" : "wsT9vaevTCW5EuKyr7nmUw",
      "_score" : 0.30685282, "_source" : {"key":"value"}
    } ]
  }
}

Reference

This was surfaced by Scott on the mailing list.

GlenRSmith commented 9 years ago

It's a little bit more fun than that, even: you actually get partial indexing!

curl -XDELETE localhost:9200/testindex
curl -XPUT localhost:9200/testindex
curl -XPOST localhost:9200/testindex/testtype -d '{"leftkey":"value","_id":{"name":"polyfractal"},"rightkey":"value"}}}'
curl -XPOST localhost:9200/_flush

Now search on the field before the _id:

curl -XGET localhost:9200/testindex/testtype/_search?pretty -d '{"query":{"term":{"leftkey":"value"}}}'
{
  "took" : 3,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 0.30685282,
    "hits" : [ {
      "_index" : "testindex",
      "_type" : "testtype",
      "_id" : "PalIN5CpSPKkGbhs4qNqaw",
      "_score" : 0.30685282, "_source" : {"leftkey":"value","_id":{"name":"polyfractal"},"rightkey":"value"}}}
    } ]
  }
}

There you go. But search on the field after the _id:

curl -XGET localhost:9200/testindex/testtype/_search?pretty -d '{"query":{"term":{"rightkey":"value"}}}'
{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 0,
    "max_score" : null,
    "hits" : [ ]
  }
}

And you get nothing.

andreaskern commented 9 years ago

I am affected by this behavior too, monogo output the field like this

{ "_id":{"$oid":"54d9e3bf30320c3335017e69"}, "@timestamp":"..."}

actually I did not care about the "_id" field, but I care about the "@timestamp" field which is silently not indexed. Here an example that shows the behavior: https://gist.github.com/andreaskern/01d1d292f7f146186ee5

clintongormley commented 9 years ago

In 2.0, the timestamp field would now be indexed correctly, as would _id.$oid. Wondering if we should allow users to index _id field inside the body at all? /cc @rjernst

rjernst commented 9 years ago

The ability to specify _id within a document has already been removed for 2.0+ indexes.

clintongormley commented 9 years ago

@rjernst you removed the ability to specify the main doc _id in the body, but if the body contains an _id field then it creates a field called _id in the mapping, which can't be queried.

What I'm asking is: should we just ignore the fact that this field is not accessible (as we do in master today) or should we actually throw an exception? I'm leaning towards ignoring, as users don't always have control over the docs they receive.

rjernst commented 9 years ago

I would be in favor of throwing an exception. This would only be for 2.0+ indexes, and it is really just field name validation (disallowing fields colliding with meta fields). The mechanism would be the same, a user would not be able to explicitly add a field _id in the properties for a document type.

clintongormley commented 9 years ago

@rjernst it's a tricky one. eg mongo adds { "_id": { "$oid": "...." }}, so actually the _id.$oid field IS queryable... should this still throw an exception?

rjernst commented 9 years ago

IMO, yes.

rjernst commented 9 years ago

With #8871, I don't think that would work, because _id is both a field mapper (the real meta field), and an object mapper.

clintongormley commented 9 years ago

@rjernst yep, makes sense

clintongormley commented 9 years ago

@rjernst this still works, even with #8871 merged in

clintongormley commented 8 years ago

Closed by #14003