elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
69.57k stars 24.62k forks source link

Partial_fields should not reorder document properties #11160

Closed abibell closed 9 years ago

abibell commented 9 years ago

When using partial fields the document's properties are reordered, we don't want that.

Steps to replicate

  1. Suppose I stored document schema is
{ "first": "abcd", "second": "xyz"}
POST _search
{
   "partial_fields" : {
        "_source" : {
            "exclude" : "field1"
        }
    }
}

Expected

Document with property order is preserved same as they were stored after excluding or including properties as per query

{ "first": "abcd", "second": "xyz"}

Actual

Document properties are jumbled

{ "second": "xyz", "first": "abcd" }

Notes

We use document like this:

{
               "header": {
                  "identifier": "oai:nla.gov.au:nla.map-edlc1-10",
                  "datestamp": "2012-11-26T04:17:25Z",
                  "setSpec": "Map",
                  "status": null
               },
               "title": "Some title",
               "creator": [
                  "Sally Smith"
               ],
               "subject": [
                  "Geological",
                  "Geological cross sections -- France",
                  "Charts"
               ],
               "description": "Section in Noeux",
               "publisher": [
                  "1920"
               ],
               "contributor": [
                  "John Smith",
                  "Sally Smith"
               ],
               "date": "1977-01-01",
               "type": [
                  "Image"
               ],
               "format": "2 ms. sections on 1 sheet : col. ; sheet 41.8 x 34.3 cm.",
               "identifier": "http://nla.gov.au/nla.map-edlc11-10",
               "source": "Item held by National Library of Australia",
               "language": ["English"],
               "relation": "Part of: [Collection of Edgeworth David's maps] [cartographic material].",
               "coverage": "1977-01-01"
            }
         }
clintongormley commented 9 years ago

Hi @abibell

Two things to note:

abibell commented 9 years ago

@clintongormley Elastic search is a great product. In fact it is the best search product. The founders have done a great job.

When people say there is a problem there most likely can be a problem. Can be their own usage. The human element makes it harder to see it. Many things are man made. Including the definition of json. I need you to objectively look at the problem. Let's take another look at it.

We use elasticsearch, we store fulltext content in the document. We don't want to see it in results as it is too bulky. Currently we have source code that takes the data and strips out the fulltext content before sending it to content. This logic should have been in elastic, which it does But it is not quite predictable behavior.

I looked at the source code it is complex and simple to retain the order. I will help fix the source code for you.

clintongormley commented 9 years ago

JSON definition doesn't say it is *unstructured according to Wikipedia or the RFC specification. Regardless of what it says. Why not make it better? Standard says json can be serialized and validated using xsd. If json was unstructured it would have failed.

The spec at http://json.org/ says:

An object is an unordered set of name/value pairs.

The Wikipedia article at http://en.wikipedia.org/wiki/JSON#Data_types.2C_syntax_and_example says:

Object — an unordered collection of name/value pairs where the names (also called keys) are strings.

Most languages use hash randomization to avoid hash collision attacks (see http://lemire.me/blog/archives/2012/01/17/use-random-hashing-if-you-care-about-security/ ) so it is quite likely that deserealizing then reserealizing a JSON object will result in a different key order. This is by design.

Also see http://stackoverflow.com/a/4515863/819598 about why you shouldn't depend on order.

abibell commented 9 years ago

I don't want this to be a "who is technically competent match". We can simply provide facts invalidating each other, continue to say wrong source, unrelated, wrong interpretation and maybe wonder why the other person is not able to see the point. I don't take everything that is written as Truth. To me technology is born to solve problems and not to be a bottleneck.

All I am saying is JSON structure was preserved in all languages I have used. It will make our lives better if we addressed the order. There is no issue if we preserve the order. Random Hashing issue you mentioned is only implemented by 3/3000+ languages and only recently. Plus randomised hashing is not going to make any difference to the order preservation. If we repeatedly serialize and deserialize the hash will change, which is wrong because it is same object. It's not the problem of randomised hashing. It is serializing processes. On 18/05/2015 9:36 pm, "Clinton Gormley" notifications@github.com wrote:

JSON definition doesn't say it is *unstructured according to Wikipedia or the RFC specification. Regardless of what it says. Why not make it better? Standard says json can be serialized and validated using xsd. If json was unstructured it would have failed.

The spec at http://json.org/ says:

An object is an unordered set of name/value pairs.

The Wikipedia article at http://en.wikipedia.org/wiki/JSON#Data_types.2C_syntax_and_example says:

Object — an unordered collection of name/value pairs where the names (also called keys) are strings.

Most languages use hash randomization to avoid hash collision attacks (see http://lemire.me/blog/archives/2012/01/17/use-random-hashing-if-you-care-about-security/ ) so it is quite likely that deserealizing then reserealizing a JSON object will result in a different key order. This is by design.

Also see http://stackoverflow.com/a/4515863/819598 about why you shouldn't depend on order.

— Reply to this email directly or view it on GitHub https://github.com/elastic/elasticsearch/issues/11160#issuecomment-103025964 .

jpountz commented 9 years ago

I'm all for user friendliness but enforcing order has a cost. Not only in terms of complexity because we would need to ensure that all json manipulations that we perform maintain order, but also in terms of efficiency. For instance lots of people would like elasticsearch to be more space-efficient, and having the ability to reorder fields could be very useful for compression (eg. by grouping together fields that have the same type together to record them type only once, putting similar fields close to each other to help LZ77 compression, etc.).

I agree with the stackoverflow article that if you need order, the right data-structure is a json list, not a hash.

abibell commented 9 years ago

Ok. The information I didn't knew is that we have _source_include & _source_exclude which will solve the problem of removing our fulltext from elasticSearch results without custom code, saving network transfers. This doesn't does have the problem of JSON reordering. Yay! Oh no.