elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
69.62k stars 24.64k forks source link

Collapse field types into "families" of field types in `_field_caps`. #53175

Closed jpountz closed 4 years ago

jpountz commented 4 years ago

Both Kibana and SQL have logic on top of the _field_caps API to detect when a conflict is harmful versus when it's not. Part of how it works is by grouping field types into families of fields: conflicts are fine as long as the types across indices are all of the same family.

Should we move this logic to _field_caps instead?

Here is an example of how we could collapse fields:

Most other fields would be their own family as they don't share properties with other fields, e.g. boolean or flattened.

So for instance, the below response today is considered a conflict:

{
    "indices": ["index1", "index2", "index3", "index4", "index5"],
    "fields": {
        "rating": { 
            "double": {
                "searchable": true,
                "aggregatable": true,
                "indices": ["index1", "index2"]
            },
            "float": {
                "searchable": true,
                "aggregatable": true,
                "indices": ["index3", "index4"],
            }
        }
    }
}

While it wouldn't be one if we collapsed double and float into a number "family".

elasticmachine commented 4 years ago

Pinging @elastic/es-search (:Search/Mapping)

costin commented 4 years ago

Having some type of "category" would be useful however I'm not sure whether in the end it would be useful across all consumers. For example for SQL there's a big difference in reporting a long vs a double as they have different size, precision ,etc... Currently in case of a merge we do consider the field conflicting since it would be a merge between similar but yet different types. This is something that can be changed as we chose a different merging strategy however it's valuable to have the consumer decide on that instead of pushing that into field_caps.

For example, if a field is mapped as a byte, short, integer, long and double in case of merging what would be winning type and why?

jpountz commented 4 years ago

This is something that can be changed as we chose a different merging strategy however it's valuable to have the consumer decide on that instead of pushing that into field_caps.

I have mixed feelings on this. We're doing everything so that field types are as transparent as possible, e.g. long fields accept range queries that have floating-point bounds and do the right thing. Yet introducing new numeric fields requires changes in our SQL layer requires changes, which feels wrong to me.

For example, if a field is mapped as a byte, short, integer, long and double in case of merging what would be winning type and why?

How do you do it in SQL today?

astefan commented 4 years ago

In SQL, we depend on field_caps almost entirely and while keyword and constant_keyword can be regarded as "equivalent" when merging mappings, there are scenarios that need to know what a field type truly is. Think about DESCRIBE index_alias and DESCRIBE single_index. The first one will consider the equivalence between keyword and constant_keyword and the field with the same name will be displayed as keyword, while the second command will show constant_keyword field type.

Note: keyword and constant_keyword is the only pair of field types that are considered equal in the merging mappings scenario.

With numeric fields, apart from the scenario above where we do not make them "equal" in any case, knowing which type is which is important when we take the _source and re-parse that numeric field to be displayed back to the user. A small note here: a scaled_float is not extracted from _source.

mattkime commented 4 years ago

When you say that kibana performs this kind of merge, I think you're talking about kibana index patterns. I'm curious if there are other parts of kibana in which the specific field types are important.

Here's the related kibana code - https://github.com/elastic/kibana/blob/master/src/plugins/data/common/kbn_field_types/kbn_field_types_factory.ts

Would this change merge collapsible fields in such as way that we would no longer know the types comprising them?

jpountz commented 4 years ago

@mattkime If knowing the actual types is a requirement for you, it would be doable. Are there places where you need to know about the actual types today?

ppisljar commented 4 years ago

i don't think we are interested in the actual elastic search type in kibana. am i missing something @timroes ?

timroes commented 4 years ago

We're storing the actual ES types and so removing that would also be a breaking change. In terms of what we're using them for: I don't have a complete list, but I think it was mainly introduced for KQL, since we needed to make some decision between text and keyword fields for the way we build the query, but I can't find that specific code right now. Another example seems to be field formatters where we are using it.

Even if we could solve those without, I think we'll always have the need. One example that would come to my mind is e.g. support for significant text aggregation (https://github.com/elastic/kibana/issues/31614). We can only achieve that if we have the real ES types available, since our grouped "String" type is not enough and we also explicitly need to bypass text fields not being aggregatable by default for allowing them for this aggregation.

I don't think removal of esTypes is something we want to do at the moment and we'll run in the future into more problems and features we cannot build with it (again). Also if we would decide to remove it, I think this should be some kind of breaking change.

jpountz commented 4 years ago

@timroes I don't think we could ever collapse text and keyword together. We are only considering doing it for fields that would have a behavior that is so close that Kibana should never worry about knowing the exact type. For instance float, scaled_float and double, integer and long, date and date_nanos, or keyword, constant_keyword and wildcard.

This would be a breaking change indeed when we change it for existing types, and this won't happen until Elasticsearch 8.0. But we could potentially start doing it for fields that are not released yet such as wildcard in 7.x.

astefan commented 4 years ago

Just to reiterate that SQL will still need to know the exact field types. It is true that at search/query time, this is not relevant for SQL in general, but reporting the data types of the fields in a table is a vital part of SQL. DESCRIBE TABLE, SHOW COLUMNS and the drivers do use the actual field types. The JDBC driver has specific methods for different field types, for retrieval and conversion purposes.

jpountz commented 4 years ago

@astefan For my understanding, what would be the problem with always returning constant_keyword field as keyword and floats as doubles in SQL? If we are already doing it in the case of aliases sometimes, what would be wrong with doing it all the time?

timroes commented 4 years ago

@timroes I don't think we could ever collapse text and keyword together. We are only considering doing it for fields that would have a behavior that is so close that Kibana should never worry about knowing the exact type. For instance float, scaled_float and double, integer and long, date and date_nanos, or keyword, constant_keyword and wildcard.

And I have the feeling with that we already run into the problem I mentioned when we discussed that last time. If Elasticsearch makes those groupings we kind of put internal Kibana knowledge that you might not have into your decision. So let's use exactly those pairs you mentioned as an example. JavaScript doesn't have 64bit integers, so we cannot represent long currently properly. The language will most likely get support for that, so at that point we have a way to treat long properly, but might need some differentiation between what's a long and what's an integer field for that. So by grouping them together here, we would take away that possibility for us to support that properly. I still think it's a high value for us knowing the exact ES type behind.

Another example is date vs date_nanos. date_nanos run in 100% into that "JavaScript can't represent the current date nano 64bit timestamp properly" situation. So we need to have a lot of special handling around date_nanos, that we don't need for dates. So it's crucial we know the difference between a date field and a date_nano field.

Meaning if we want to get rid of exact ES types in _field_caps I think we would need to have a shared sync for every field to know the impact on Kibana before we decide on families. And even that I don't think will work, since there might simply be future features for which we will need to know that difference, that we currently are not aware of, and might block us with our decision on. It happened in the past, that was the whole reason we needed to introduce storing the specific ES types, since we saw our family grouping blocked some things, and I don't see how it would be different if we put the family grouping into ES. So my strong recommendation would be, keep the actual field type information around for Kibana to store and use.

jpountz commented 4 years ago

@timroes I'm still unclear why this matters. For instance today, does it make any difference to Kibana if someone indexed integers in a long field vs. an integer field? I have a similar question regarding this special handling that you have for date_nanos, what would be wrong with applying the same handling to regular date fields? Sorry for being pushy but I'm genuinely interested in understanding better, especially as I wonder that there might be workarounds in Kibana for things that we should consider bugs in Elasticsearch and that we're not aware of.

timroes commented 4 years ago

Currently for long and integer we don't have specific different logic, so no today it doesn't matter. But that's kind of my point, we can't rule it out, since as soon as we have language support for 64bit integers we might want to parse long values later differently then integer values in Kibana to actually support them. That is not set in stone, it's just an example of some future feature we might run into problems, when we don't have the type, and was a more or less realistic example I could come up with.

With regards to date vs. date_nanos I don't think there's anything you can do differently. The problem is simply, we are not able to parse the current time in nanoseconds and represent that in JavaScript (since it's outside the range you can safely store inside a 64 bit floating point number without precision loss). Also our used date library moment.js can neither handle them (mainly due to the same reasons of not being able to represent the timestamp as a number). That means we e.g. have a specific date_nanos field formatter, which will parse the ISO string of the date by cutting away the nanos parts of the string and store them for later, then parse the rest regularily via the date library and later append the nano seconds again. Therefore this field formatter unfortunately doesn't allow configuring the template of the formatted string as freely as the regular date formatter. A couple of related issues on that:

I think none of them is anything that could be solveable in Elasticsearch.

I btw fully support your base statement here: If any of those things where we need to make differences in Kibana based on the specific field type is solvable in ES, we should solve it in ES and not in Kibana. I def want to keep those places as low as possible, so far it just have shown not be 0, and as long as I can still "easily" make up exampled like integer vs long I feel worried that we'll always have cases where we need those specific types.

jtibshirani commented 4 years ago

My understanding is that we'd like to collapse field types together that have nearly identical read-time behavior -- and this includes not only searching but also loading/ displaying field values. But there's a question of what 'nearly identical behavior' actually means for the different field types.

To me there are two different cases. First, it seems like we're hesitant to collapse together fields like long and integer, or date and date_nanos, because they have different precision and the client may want to handle them differently. For example, in the case of the JDBC driver, we want to match the data types in the spec as closely as possible. That way a field mapped as a float will be returned as a Java float.

In the second case, a small number of field types are designed to have the same external behavior as a 'standard' type, but give different performance trade-offs in terms of disk usage or speed. These field types could be presented as if they were the 'standard' field type:

Related to this case, we plan to introduce 'runtime fields' that are defined in the mappings, but calculate their values on-the-fly. For example there could be a runtime field that takes two concrete integer fields and adds them together. Regardless of how we choose to design the runtime fields (maybe it's a new mapping type), in field caps it should be presented as an integer.

Would we at least be interested in collapsing field types in the second case?

timroes commented 4 years ago

I think for the second case it could make sense collapsing them. So if it's really just performance optimizations behind the scenes, but otherwise behaving exactly the same, we could imho collapse them. I am just having concerns collapsing fields as long as they are not behaving exactly the same from an API point of view, and that could mean: different datatypes returned, they are not be able to be used in exactly the same APIs (and already having one very specific aggregation not being able to work with both, means for me we should not collapse them - or at least need the raw field information in that case), differences in getting access to them via docvalues, scripts, etc.

If they are really behaving exactly the same in all those cases I don't see a problem (from Kibana side) to collapse them together, and even without us having the original field information. As soon as they are only 99.9% the same in regards to their API behavior, I'd not collapse them together, since I'll very likely be able to come up with an example where we would need the differentiation.

So in your example, it sounds like we could merge constant_keyword to keyword, but from Mark's comment it doesn't look like we could/should merge wildcard to keyword, since wildcard is not being able to used e.g. in regex queries, range queries or fuzzy queries, so it's def not behaving the same in terms of API.

jpountz commented 4 years ago

from Mark's comment it doesn't look like we could/should merge wildcard to keyword, since wildcard is not being able to used e.g. in regex queries, range queries or fuzzy queries, so it's def not behaving the same in terms of API.

FYI this is getting worked an these days in order to make both fields support exactly the same operations.

jtibshirani commented 4 years ago

As soon as they are only 99.9% the same in regards to their API behavior, I'd not collapse them together, since I'll very likely be able to come up with an example where we would need the differentiation.

Maybe we could think about this more as a trade-off instead of a rule 'as soon as they are different at all, we shouldn't collapse the types'. For example, a field alias on a keyword supports almost all the same functionality keyword, but there are still unsupported APIs like using an alias in a terms lookup query. We still decided to present field aliases as if they were the concrete type, and this let us support aliases in Kibana quickly and with minimal changes. There's less future flexibility on Kibana's end, but perhaps the trade-off in terms of simplicity + effort was worth it.

jtibshirani commented 4 years ago

@timroes @astefan with the wildcard field coming out soon, it would be nice to settle on the behavior. Would you be okay with the following changes?

The scaled_float change might need to wait until the fields API is introduced, because I think SQL now needs to load scaled floats from doc values, whereas other numeric fields are pulled from _source.

jtibshirani commented 4 years ago

Also tagging @markharwood for awareness on the above.

timroes commented 4 years ago

constant_keyword presented as keyword (8.0)

That sounded fine to me, given that it behaves exactly the same, right?

wildcard presented as keyword (for 7.x when this new type is released)

Just for clarification from that issue in Kibana. Are there now differences in how it can be used in queries, and APIs (not performance differences, but APIs where it's simply not working if you use a wildcard field). Because if so I'd stand with my statement above: we should not collapse fields if they are not behaving exactly the same within the API. While I see your wish for being easy to add those new field types to Kibana by yourself, if they are not supporting the exact same APIs, you potentially simply create bugs in Kibana, by doing so, since suddenly maybe the KQL we compile or the filters we create are no longer working. I think it's really not a good idea, adding even slightly functional different fields automatically to Kibana, without having someone from the Kibana team who knows all the potential side effects, etc. have a look on that, and have adding that field type going through our regular PR review process.

We still decided to present field aliases as if they were the concrete type, and this let us support aliases in Kibana quickly and with minimal changes.

That might not be that good example as you think it is, since Aliases are supported really not well in Kibana, and we get very often negative feedback on how Aliases work (or don't) in Kibana (just haven't had the capacity yet to address this) :D

astefan commented 4 years ago

@jtibshirani for "runtime" reporting of field data types in _field_caps, I don't see the collapsing as an issue, but there is also the other use case where we show the users the field types with commands like SYS COLUMNS. Think of _field_caps as trimmed down version of the GET _mapping API: we extrapolate from its content the mapping of all the fields in a bunch of indices.

Collapsing those field types together means we'll tell users or to BI tools (via SYS COLUMNS) that a column X is of type keyword while, in fact, it's of type wildcard. CCing @costin to get his input on this, as well.

jimczi commented 4 years ago

Just for clarification from that issue in Kibana. Are there now differences in how it can be used in queries, and APIs (not performance differences, but APIs where it's simply not working if you use a wildcard field).

No there is not. Although, I think we should approach this differently. The intent for the wildcard field is to provide the same functionality than a keyword field so any difference should be treated as a bug. #58044 is a good example, we found a discrepancy and fixed it so that keyword and wildcard fields behave the same in scripts. That would hold for any family that we create for _field_caps, fields within a family should behave the same.

Collapsing those field types together means we'll tell users or to BI tools (via SYS COLUMNS) that a column X is of type keyword while, in fact, it's of type wildcard

I wonder why you consider this as an issue ? Why would BI users care about ES mapping internals if they can rely on family types ? Knowing that the field is of family keyword should be enough to design an SQL query ?

astefan commented 4 years ago

I wonder why you consider this as an issue ? Why would BI users care about ES mapping internals if they can rely on family types ? Knowing that the field is of family keyword should be enough to design an SQL query ?

Float and scaled_float data types have different precisions (display column size). For the other three in question the precision is the same. And it's about the BI tools themselves which, in theory, look at the display size for numbers to know how many digits to display.

For the text based data types in question, I think it's more of a conceptual decision (as to why I CCed @costin to get his input on this) to have inconsistent (or incorrect) information displayed as a result of a SYS COLUMNS query.

costin commented 4 years ago

I can't think of a big disadvantage of this change in SQL. The main change will be describing an index (DESCRIBE TABLE) will not show the exact mapping of a index (keyword vs constant_keyword). I think that's fine since when dealing with aliases/merging we would have merge the fields anyway which some might find confusing. I'm not clear whether numeric types would be always widen, namely float to double or int to long. If that is the case, this creates issue is that without explicit casting, SQL will never return int or double which would be unexpected. While keyword, constant_keyword and wildcard are in the end strings (the char family), int in particular is common, popular type that we would end up not supporting. And fairly different from long.

jtibshirani commented 4 years ago

The intent for the wildcard field is to provide the same functionality than a keyword field so any difference should be treated as a bug... That would hold for any family that we create for _field_caps, fields within a family should behave the same.

I think @jimczi's point is a really important one -- if we collapsed two field types into a 'family', then we would consider any significant difference in functionality to be a high-priority bug ES.

Even beyond the field caps API, it's really nice to have this concept of a field type 'family' that's guaranteed to have the same behavior. I think it will help users understand all the new field types we're adding -- they can think of constant_keyword and wildcard as being variants on 'keyword' and be assured they support the same functionality.

It sounds like there are concerns around collapsing scaled_float and float, but agreement on the other types. I suggest we hold off on numeric types for now and focus on these changes:

If there is still hesitation, please feel free to reply (or ping me directly) -- I think the next step would be to have a short discussion as a group.

astefan commented 4 years ago

@jtibshirani one small apparently obvious aspect, just to make sure this is also covered: collapsing fields together into a "common" type should only be applied when merging fields, and not also when a single index is involved in the _field_caps request, right? As an example, if in the case of two indices - one with a test field of type wildcard and the other with the type of the same field test as constant_keyword - _field_caps will report the test field as keyword. But, if _field_caps is called for the first index only, the result will report test as wildcard.

jimczi commented 4 years ago

@astefan, no the plan is to always return keyword for these fields, constant_keyword and wildcard will not be part of the types that field_caps can return.

astefan commented 4 years ago

Thank you for clarifying. After all, it wasn't so obvious as I thought :-).

markharwood commented 4 years ago

I've updated master and 7x branches with support for the keyword family. This also added a new method in MappedFieldType which can be overridden by other fields to declare family membership

jtibshirani commented 4 years ago

Now that we've decided we're okay having 'field type families' in certain circumstances (and have a lightweight way to add new families thanks to @markharwood), I suggest we close this issue. To add future families, we can open dedicated issues/ PRs and discuss there.

I also added a note to #57548 about adding the concept of a 'field type family' to our docs. I think it'd be helpful if we surfaced the relationship between keyword, constant_keyword, and wildcard -- users can be assured they support all the same functionality.