Case insensitive field names

niemyjski commented 3 years ago

I've searched the issues and community forum but couldn't really find any requests or issues talking about this.

I'd love for field names to be case insensitive, this would really allow for more scenarios for things like source includes (take a field list from an api of data to include and it just work). It probably would cause a lot less time for people tracking down other issues as well...

POST test-v1/_doc/test
{
  "test": "abc",
  "Test": "abcd",
  "tEst": "abcde"
}

GET /test-v1/_mapping
{
  "test-v1" : {
    "mappings" : {
      "properties" : {
        "Test" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
        "tEst" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
        "test" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        }
      }
    }
  }
}

For example, we have a query parser (https://github.com/FoundatioFx/Foundatio.Parsers) and we can resolve any mapped field to the correct case, but cannot with unmapped fields.

Also, I'd really love to know why fields names are still case sensitive (I understand that JSON is case sensitive) and how there hasn't been a breaking change to change the field name behavior. One could logging a warning when multiple field names are present, or just indexing the first field (and discarding any extra with the same name?). I could see the document having different cases of a field and that's ok, it would share a single mapping. This would also help out by preventing field explosions and make querying and using the elastic api easier.

Databases have a variety of sensitivities. SQL, by default, is case insensitive to identifiers and keywords, but case sensitive to data. JSON is case sensitive to both field names and data. https://blog.couchbase.com/json-case-sensitive-insensitive-search-index-data/#:~:text=SQL%2C%20by%20default%2C%20is%20case,both%20field%20names%20and%20data.

markharwood commented 3 years ago

Thanks for the comments.

Also, I'd really love to know why fields names are still case sensitive (I understand that JSON is case sensitive)

Unfortunately I think that's the answer. We're built on JSON and its behaviour is something we can't change. As far as I can tell MongoDB is the same in this regard

and how there hasn't been a breaking change to change the field name behavior

Any breaking change has to reach a certain level of importance for it to be considered. The importance can be measured by things like: 1) The number of people calling for the change 2) The lack of any good workarounds in the status quo 3) Our ability to migrate cleanly (e.g. having old indices and new indices co-exist under new software)

By the above measures: 1) I think this is the first time we've had this issue logged 2) Clients can normalise data in their client code or using ingest pipelines (although query/agg field names have no equivalent of doc ingest pipelines to change fieldnames) 3) I imagine it will be very difficult to provide software that allows a cluster to run a mix of old and new indices. Also, the alternative of asking customers to reindex all historical data is typically a no-no

I'll keep this issue open to see if it attracts any more interest but I wouldn't bank on this change happening anytime soon.

elasticmachine commented 3 years ago

Pinging @elastic/es-search (Team:Search)

ejsmith commented 3 years ago

I've wondered this as well. It seems like insanity to me that you can create 2 fields with the same name and different casing.

markharwood commented 3 years ago

I discussed this with the team today and we agreed that while a nice-to-have for some users the impact of such a change is huge and not something we will realistically attempt in the foreseeable future. Closing, but will reopen if this ever changes.

Thanks for reaching out and sorry we're not able to help with this.

StingyJack commented 3 years ago

a nice-to-have for some user

@markharwood - Would any of you assign a different meaning to the word "DOG" if it were spelled "dog"? No, because its still referring to canis familiaris

This problem manifests itself as duplicated data points, often with different values. Think thats rare? Happens every time I use NEST. I can create a mapping that is this...

{
    "settings": {
        "analysis": {
            "normalizer": {
                "customSearchNormalizer": {
                    "type": "custom",
                    "char_filter": [],
                    "filter": ["lowercase", "asciifolding"]
                }
            }
        }
    },
    "mappings": {
        "properties": {
            "ListName": {
                "type": "keyword",
                "normalizer": "customSearchNormalizer"
            },
            "FieldNames": {
                "type": "keyword",
                "normalizer": "customSearchNormalizer"
            }          
        }
    }
}

... and verify that in kibana, but when the first document is indexed, NEST will decide to use some other casing for field names and changes the mapping to this...

{
  "mappings": {
    "_doc": {
      "properties": {
        "FieldNames": {
          "type": "keyword",
          "normalizer": "customSearchNormalizer"
        },
        "ListName": {
          "type": "keyword",
          "normalizer": "customSearchNormalizer"
        }
        "fieldNames": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        },
        "listName": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        }
      }
    }
  }
}

... and then I cant find any documents because I created the index using FieldNames and its added to that mapping to make fieldNames the place where the data is actually stored.

 "_source" : {
          "listName" : "MyList",
          "fieldNames" : [
            "OrgField1",
...

Yes, I know about .DefaultFieldNameInferrer(p => p), but the point is I shouldnt have to google around or remember to do that.

JSON is spec-ed wrong for the same reason. Nobody really wants to have two object properties with different cases and different values in a data payload. Thats part of the recipe for a nightmare-level support and troubleshooting experience, and ES isnt required to jump into that pit just because JSON does.

MSSQL (or Sybase at the time) figured out 30+ years ago that users dont want to deal with differences in casing when they go look for their data, and that they dont want their data altered by forcing some normalization scheme in order to enable that searching. The scenario in your userbase where someone actually depends on having field names of different cases is going to be exceedingly rare if at all. Give an option to allow case different duplicates if you think there are any users who need it, but please dont make the rest of us continue to suffer this problem.

markharwood commented 3 years ago

Would any of you assign a different meaning to the word "DOG" if it were spelled "dog"?

We're not compiling a dictionary here :) As you well know, some things in computing are case sensitive e.g. the unix file system and the same questions over "usefulness" could be raised there. I happen to agree that case sensitivity is not generally useful in field names but it is so firmly entrenched in so many deployments that we cannot simply flick a switch to change this.

where someone actually depends on having field names of different cases is going to be exceedingly rare if at all.

Trust me, someone out there somewhere is using field names to store hashes where a change in case would be catastrophic to them. As a result, we have to go through a complex procedure of introducing opt-in case-insensitivity flags, deprecation warnings and backward compatibility code for old clusters before flipping default behaviour etc. This migration effort for us and our users is what puts this firmly in the "high-hanging fruit" category and why we are not rushing to fix right now.

StingyJack commented 3 years ago

Are you sure you arent compiling a dictionary somewhere? At least as a test case?

Thank you for hearing my complaints.