Closed gplechuck closed 2 years ago
From linked discussion -
Hi @gplechuck, Thanks to have opened an issue about this problem and thanks for your detailed explanation about “why it is important to fix it”. To answer your question, there is no real need to sort by id specifically. The real need is to be able to scroll the data among several queries, when requested size is over 10000 hits, and avoid pagination problems (duplicates or misses). So to do that, an efficient way should be to sort by “_doc”, that seems to be the recommendation up to Elastic documentation. What do you think about this change?
Hi @fbaligand ,
The suggestion to sort by _doc sounds very promising. I've done a small bit of testing - modified the plugin to use _doc instead of _id -
# from file .data_load/enhanced_table_request_handler.js , line 50, plugin, enhanced-table-1.13.0_7.6.1
if ((visParams.hitsSize !== undefined && visParams.hitsSize > MAX_HITS_SIZE) || visParams.csvFullExport) {
searchSourceFields.sort.push({'_doc': 'asc' });
}
The resulting sort portion of the queries now looks like this (when we exceed 10000 hits) -
"sort": [
{
"_score": {
"order": "desc"
}
},
{
"_doc": {
"order": "asc",
"unmapped_type": "boolean"
}
}
],
I can confirm that queries that previously resulted in increases in large amounts of _id fielddata , no longer do so, so that's a step in the right direction I think .
I see that the plugin uses the 'search after' feature, and that Elasticsearch recommend the use of _doc for sorting on here - https://www.elastic.co/guide/en/elasticsearch/reference/current/sort-search-results.html
I have a small concern though, might be nothing , but worth testing..... I didn't personally see duplicates or missing records in tests I did so far (the count of docs in the table matched the expected count of events, no duplications etc... ) , but I can also see that the _doc value is different depending on whether the result is returned from a primary or a replica shard. To test that, use the Kibana dev tools console and search for a single doc , in an index with one or more replicas , and sort by _doc . Retry the query several times , you can see in the results that the _doc value differs . I have no idea offhand if this could be a potential issue when using the search after feature , or if its a solved problem, perhaps you would have a better idea .
Cheers
Further to previous comment ... have seen some duplicate entries in a recent test unfortunately. Tested loading a visualization from 1 hour ago to 'now' in a cluster where documents are often updated multiple times shortly after creation to enrich them with further information. Subsequent document updates change the _doc number. I suspect if a doc is updated while a long running visualization is loading (paginating through hits ) that it could potentially cause a problem with the sorting - again, not sure , but worth keeping in mind and testing further .
A new field called '_shard_doc' has been added to recent versions of Elasticsearch, which could potentially be used in the sort - but it may have been added after version 7.10 which might be an issue if you want to remain compatible with OSS Elasticsearch and earlier versions. Perhaps making the sort tiebreaker default to '_id', but be a configurable option, could be useful to allow people to set their own unique sort fields.
Thanks a lot for your tests. Well, the plugin has a git branch for every Kibana breaking version. There is especially a 7.9 branch and 7.11 branch (and more). We can imagine to use "_doc" for branches <= 7.9 and "_shard_doc" for branches >= 7.11.
Do you have tested "_shard_doc"? No memory problem? No duplicates or misses?
Hey, no problem. I didn't test _shard_doc (I'm running an earlier version of Elasticsearch) , it seems it was added in version 7.12. I mentioned version 7.10 as that's the last common version before the Opensearch fork.
I'm not sure "_doc" will be 100% reliable for sorting in earlier versions (having seen duplicates in the limited testing I did) , but I could be wrong.
Perhaps making this sort field configurable (but default to _id) would be a compromise, a different unique field could be selected by users to ensure sort order without loading field data in the heap.
Well, I don’t think that sort by _id protects better against duplicates and misses, especially if id is not incremental.
The only safe way to avoid duplicates or misses would be to use pit
feature. But it’s complicated to use this as it requires an additional specific http request, and I don’t know how to do that in my front end plugin.
Hey @fbaligand
The_id field would be unique and remain unchanged whereas the value of the _doc field can change in some conditions.
For my own use case, there is a unique field that can be selected in the visualization as the sort field, and the combination of sorting on this unique field and _doc is proving effective, memory consumption is lighter and I'm not seeing duplicates . However if using the default sort field of _score in conjunction with _doc, there might be some scope to miss or duplicate events in the paginated searches.
Well, _score sort is rarely the good sort in kibana, especially, if you want to get a lot of hits, or worst, download all the data. It is the default in plugin, because it is elasticsearch default, and _score is generic (it exists for everybody) Using your specific unique field is a good idea.
Concerning _id, indeed, it will not change for a document. But between 2 search queries, you can have new ids that can generate misses. And as you explain, it can be dangerous for es memory. So I’m not convinced by the idea of making it a plugin option. For most users, this setting is weird or understandable, and it is not very useful, given that _doc and _shard_doc make the job and are recommended by Elastic.
Sounds good, thanks. Bear in mind that if the documents are frequently updated (the cluster I was testing can have several document updates shortly after creation), their _doc values will change.
Hi @gplechuck,
I have good news! Enhanced Table plugin v1.13.3 has been release with use of '_doc' for sort instead of '_id': https://github.com/fbaligand/kibana-enhanced-table/releases/tag/v1.13.3
Excellent, thanks!
Discussed in https://github.com/fbaligand/kibana-enhanced-table/discussions/271