Improve fielddata mappings

clintongormley commented 9 years ago

Currently, the settings for fielddata and doc_values is quite confusing. It would be nice to make it easier to understand. For non-analyzed fields, these are the questions which need answering:

Should fielddata be written to disk at index time?
Should fielddata be available at search time, from disk, or in memory?
If in memory, should fielddata be loaded eagerly or lazily?
Regardless of disk or memory, should global ordinals be built eagerly or lazily?

This could be expressed as:

"fielddata": {
  "index_format":    "disk | disabled",
  "search_format":   "disk | memory | eager_memory | disabled",
  "global_ordinals": "lazy | eager"
}

In the same way as we can use analyzer to set both index_analyzer and search_analyzer, this could be condensed to:

"fielddata": {
  "format":          "disk | memory | eager_memory | disabled",
  "global_ordinals": "lazy | eager"
}

Analyzed string fields can't be written to disk, but they can support the fst format, so they would accept:

"fielddata": {
  "format":          "memory | eager_memory | fst | eager_fst | disabled",
  "global_ordinals": "lazy | eager"
}

We could possibly even support a very simple format for setting the fielddata format:

"fielddata": "disk | memory | eager_memory | disabled"            # not_analyzed
"fielddata": "memory | eager_memory | fst | eager_fst | disabled" # analyzed

... which would set global_ordinals to `lazy``

Regardless of which format is used to set fielddata, the mappings would be converted to the full index_format, search_format, global_ordinals layout.

Question: Should index_format and search_format instead by store_format and load_format?

rmuir commented 9 years ago

some of these possibilities make no sense. Like writing docvalues at index-time, but then loading stuff up into field data.

rmuir commented 9 years ago

Also: "disk" is not a synonym for docvalues. It loads some things in memory. Just not everything, and not in a bloated way.

clintongormley commented 9 years ago

some of these possibilities make no sense. Like writing docvalues at index-time, but then loading stuff up into field data.

This is a possibility which works today. It will allow users to write doc values, then test out whether loading a particular field into fielddata helps their use case or not. It just brings slightly more flexibility than having to reindex your data just to try things out.

Also: "disk" is not a synonym for docvalues. It loads some things in memory. Just not everything, and not in a bloated way.

Sure, but the major distinction between the two implementations is: one is in memory and the other is largely on disk. It will be more understandable to users than referring to doc values.

rmuir commented 9 years ago

Thats not really the distinction: again its the incorrect name.

Lets call it "scalable" and "non-scalable" ? Fielddata is slightly faster because it uses bloated compression etc and makes the wrong tradeoffs.

rmuir commented 9 years ago

This is a possibility which works today. It will allow users to write doc values, then test out whether loading a particular field into fielddata helps their use case or not. It just brings slightly more flexibility than having to reindex your data just to try things out.

no, its just a trap.

terrible.

jpountz commented 9 years ago

"fielddata": "disk | memory | eager_memory | disabled" # not_analyzed

Actually eager loading currently makes sense for doc values too since Lucene only loads them on the first time that they are used (like norms).

s1monw commented 9 years ago

I actually agree with rob on the possiblity to load stuff into FD if you actually have DocValues this doesn't make too much sense. I'd rather trade the flexibility here for safety since it can really spike your system big time if you suddenly go to FD?

clintongormley commented 9 years ago

One of the issues with naming is that you're limited to conveying just part of the explanation in the name itself. disk vs memory explains one part, scalable vs unscalable is another. @s1monw suggested trying to explain when you pay the price: at index time or at query time (with the other details explained in full in the documentation.) For instance:

"fielddata": {
  "build": "eager | lazy",  # eager = doc values, lazy = memory
  "load": "lazy   | eager | eager_global_ordinals"
}

This removes the fst format for not_analyzed strings, which is rarely used anyway.
Once build is set to eager, doc values will always be built (we can't turn them off)
I know @rmuir and @s1monw still disagree, but I'm still voting for allowing users to test out in-memory fielddata performance on an existing field by allowing them to set load: lazy on the fly (with build: eager), without forcing them to reindex all of their data

s1monw commented 9 years ago

I like the naming though... I personally think it's a trap to go back to fielddata from here but I don't have super strong feelings. We allow for a lot of traps to provide flexiblity so it's not just black and white. I'd personally vote for safety here vs. trappyness. Maybe we can enable trappy mode and then you can do this? :)

eLBhogi commented 9 years ago

how can I define a lazy global ordinals mapping for a non-analyzed field?

clintongormley commented 9 years ago

I personally think it's a trap to go back to fielddata from here

I've come around to agree with this sentiment. Once we build doc values for a field, we shouldn't allow switching back to using field data at that stage.

So I repeat the last recommendation:

"fielddata": {
  "build": "eager | lazy",  # eager = doc values, lazy = memory
  "load": "lazy   | eager | eager_global_ordinals"
}

with one distinction, if doc values are enabled (ie build:eager) then load: lazy | eager applies just to doc values, ie should Lucene open them immediately or on first use.

Any other suggestions on this change?

rmuir commented 9 years ago

I don't really like eager vs lazy to explain the differences here.

There are plenty of differences, e.g. using filesystem cache versus on-heap memory. e.g. actually doing bitpacking versus bloating values up to 8/16/32/64. So eager vs lazy doesn't really make sense to me.

Do we really need a fielddata setting at all? You can already set docvalues another way.

s1monw commented 9 years ago

@rmuir i am all for one way of setting this. My take on this is the following:

use doc_values : true|false to trigger DV, that is consistent to all the other FieldType settings we expose ie. indexed : true|false
remove the fst format and make it a hard choice to either use DV or FieldData at field creation time. Since today we have paged_bytes, fst, doc_values and I think if we get rid of the last two we can simplify this a lot?
only allow configuration of fielddata.global_ordinals : lazy|eager and only if the field has no DV
if folks wanna experiment they can use a separate field or we can add an explicit setting saying fielddata.doc_values: false|true that can be used to override the field setting if this helps to make progress here.

clintongormley commented 9 years ago

The motivation for this issue has changed somewhat since we switched to doc values by default. Initially, we were trying to make it easier (and less confusing) to use doc values. Now, with better defaults, users should only fiddle with these settings if they know what they're doing.

I suggest removing the doc_values setting altogether, and using just the fielddata settings as follows:

fielddata: {
    format:          "doc_values | heap",   
    load:            "lazy | eager | eager_global_ordinals | disabled"
}

The format defaults to doc_values for all field types except analyzed strings (and I believe geo-ip doesn't support doc values yet?). This setting should not be dynamically updatable.

The load setting defaults to lazy, so that we're not loading doc values for (potentially) very many fields automatically. Analyzed strings could potentially default the load setting to disabled so that drive-by users don't cause a massive fielddata load by eg sorting on the "name" field. It should be possible to change the load setting dynamically.

clintongormley commented 9 years ago

After chatting to @jpountz and @s1monw about this, we have a new proposal. Assuming doc values deliver on their promise, the long term intention will be to remove in-memory fielddata completely. (This assumes that we have a solution for analyzed string fields as well).

New proposal:

doc_values:       true | false              # defaults to true for all but analyzed string fields
fielddata:        disabled | lazy | eager   # only if doc_values is false, will go away eg in 3.0
global_ordinals:  lazy | eager              # defaults to lazy

Later we may be able to add the auto option to global_ordinals so that we can make a runtime decision based on usage about whether global ordinals should be built lazily or eagerly.

s1monw commented 9 years ago

+1 @rmuir WDYT

clintongormley commented 9 years ago

To clarify, global_ordinals: lazy means "build global ordinals only when they are required for a request", while eager means "build global ordinals on every refresh"

clintongormley commented 9 years ago

Closing in favour of #12394

elastic / elasticsearch

Improve fielddata mappings #8693