Closed clintongormley closed 9 years ago
some of these possibilities make no sense. Like writing docvalues at index-time, but then loading stuff up into field data.
Also: "disk" is not a synonym for docvalues. It loads some things in memory. Just not everything, and not in a bloated way.
some of these possibilities make no sense. Like writing docvalues at index-time, but then loading stuff up into field data.
This is a possibility which works today. It will allow users to write doc values, then test out whether loading a particular field into fielddata helps their use case or not. It just brings slightly more flexibility than having to reindex your data just to try things out.
Also: "disk" is not a synonym for docvalues. It loads some things in memory. Just not everything, and not in a bloated way.
Sure, but the major distinction between the two implementations is: one is in memory and the other is largely on disk. It will be more understandable to users than referring to doc values.
Thats not really the distinction: again its the incorrect name.
Lets call it "scalable" and "non-scalable" ? Fielddata is slightly faster because it uses bloated compression etc and makes the wrong tradeoffs.
This is a possibility which works today. It will allow users to write doc values, then test out whether loading a particular field into fielddata helps their use case or not. It just brings slightly more flexibility than having to reindex your data just to try things out.
no, its just a trap.
terrible.
"fielddata": "disk | memory | eager_memory | disabled" # not_analyzed
Actually eager loading currently makes sense for doc values too since Lucene only loads them on the first time that they are used (like norms).
I actually agree with rob on the possiblity to load stuff into FD if you actually have DocValues this doesn't make too much sense. I'd rather trade the flexibility here for safety since it can really spike your system big time if you suddenly go to FD?
One of the issues with naming is that you're limited to conveying just part of the explanation in the name itself. disk
vs memory
explains one part, scalable
vs unscalable
is another. @s1monw suggested trying to explain when you pay the price: at index time or at query time (with the other details explained in full in the documentation.) For instance:
"fielddata": {
"build": "eager | lazy", # eager = doc values, lazy = memory
"load": "lazy | eager | eager_global_ordinals"
}
fst
format for not_analyzed
strings, which is rarely used anyway.build
is set to eager
, doc values will always be built (we can't turn them off)load: lazy
on the fly (with build: eager
), without forcing them to reindex all of their dataI like the naming though... I personally think it's a trap to go back to fielddata from here but I don't have super strong feelings. We allow for a lot of traps to provide flexiblity so it's not just black and white. I'd personally vote for safety here vs. trappyness. Maybe we can enable trappy mode and then you can do this? :)
how can I define a lazy global ordinals mapping for a non-analyzed field?
I personally think it's a trap to go back to fielddata from here
I've come around to agree with this sentiment. Once we build doc values for a field, we shouldn't allow switching back to using field data at that stage.
So I repeat the last recommendation:
"fielddata": {
"build": "eager | lazy", # eager = doc values, lazy = memory
"load": "lazy | eager | eager_global_ordinals"
}
with one distinction, if doc values are enabled (ie build:eager
) then load: lazy | eager
applies just to doc values, ie should Lucene open them immediately or on first use.
Any other suggestions on this change?
I don't really like eager vs lazy to explain the differences here.
There are plenty of differences, e.g. using filesystem cache versus on-heap memory. e.g. actually doing bitpacking versus bloating values up to 8/16/32/64. So eager vs lazy doesn't really make sense to me.
Do we really need a fielddata setting at all? You can already set docvalues another way.
@rmuir i am all for one way of setting this. My take on this is the following:
doc_values : true|false
to trigger DV, that is consistent to all the other FieldType settings we expose ie. indexed : true|false
fst
format and make it a hard choice to either use DV or FieldData at field creation time. Since today we have paged_bytes
, fst
, doc_values
and I think if we get rid of the last two we can simplify this a lot?fielddata.global_ordinals : lazy|eager
and only if the field has no DVfielddata.doc_values: false|true
that can be used to override the field setting if this helps to make progress here. The motivation for this issue has changed somewhat since we switched to doc values by default. Initially, we were trying to make it easier (and less confusing) to use doc values. Now, with better defaults, users should only fiddle with these settings if they know what they're doing.
I suggest removing the doc_values
setting altogether, and using just the fielddata
settings as follows:
fielddata: {
format: "doc_values | heap",
load: "lazy | eager | eager_global_ordinals | disabled"
}
The format
defaults to doc_values
for all field types except analyzed strings (and I believe geo-ip doesn't support doc values yet?). This setting should not be dynamically updatable.
The load
setting defaults to lazy
, so that we're not loading doc values for (potentially) very many fields automatically. Analyzed strings could potentially default the load
setting to disabled
so that drive-by users don't cause a massive fielddata load by eg sorting on the "name" field. It should be possible to change the load
setting dynamically.
After chatting to @jpountz and @s1monw about this, we have a new proposal. Assuming doc values deliver on their promise, the long term intention will be to remove in-memory fielddata completely. (This assumes that we have a solution for analyzed string fields as well).
New proposal:
doc_values: true | false # defaults to true for all but analyzed string fields
fielddata: disabled | lazy | eager # only if doc_values is false, will go away eg in 3.0
global_ordinals: lazy | eager # defaults to lazy
Later we may be able to add the auto
option to global_ordinals
so that we can make a runtime decision based on usage about whether global ordinals should be built lazily or eagerly.
+1 @rmuir WDYT
To clarify, global_ordinals: lazy
means "build global ordinals only when they are required for a request", while eager
means "build global ordinals on every refresh"
Closing in favour of #12394
Currently, the settings for fielddata and doc_values is quite confusing. It would be nice to make it easier to understand. For non-analyzed fields, these are the questions which need answering:
This could be expressed as:
In the same way as we can use
analyzer
to set bothindex_analyzer
andsearch_analyzer
, this could be condensed to:Analyzed string fields can't be written to disk, but they can support the
fst
format, so they would accept:We could possibly even support a very simple format for setting the fielddata format:
... which would set
global_ordinals
to `lazy``Regardless of which format is used to set fielddata, the mappings would be converted to the full
index_format
,search_format
,global_ordinals
layout.Question: Should
index_format
andsearch_format
instead bystore_format
andload_format
?