Sorting by '_id' in Document Table visualization potentially causes large increase in heap usage (field data)

gplechuck commented 2 years ago

Discussed in https://github.com/fbaligand/kibana-enhanced-table/discussions/271

^{Originally posted by **gplechuck** May 17, 2022} Hey @fbaligand , Very nice plugin, have been using the 'Document Table' visualization for some time and am very happy with it as a lightweight alternative to the 'Data Table' visualization. Have noticed a problem when querying large datasets however! It seems that when the 'Max Hits' is set higher than 10000 , a sort by '_id' field is introduced into the resulting query. Sorting by '_id' is not recommended by Elasticsearch and can lead to a large amount of cached _id field data occupying the heap (I think use of _id field data will actually be disabled as default in future versions , see https://github.com/elastic/elasticsearch/issues/64511 ). I recently ran a query over a couple years worth of our data, and the overall heap usage spiked by 91GB, almost exclusively _id field data. Use of the '_id' field does not appear to be configurable in the Document Table 'Query Parameters' I had a very quick look at the source files and can see where the _id sort is introduced - ``` # from file .data_load/enhanced_table_request_handler.js , line 50, plugin, enhanced-table-1.13.0_7.6.1 if ((visParams.hitsSize !== undefined && visParams.hitsSize > MAX_HITS_SIZE) || visParams.csvFullExport) { searchSourceFields.sort.push({'_id': {'order': 'asc','unmapped_type': 'keyword'}}); } ``` Is this behaviour essential for anything and would there be any adverse effects in removing that snippet of code from the plugin as a workaround, so we could continue to use the table for large datasets ? Cheers

gplechuck commented 2 years ago

From linked discussion -

Hi @gplechuck, Thanks to have opened an issue about this problem and thanks for your detailed explanation about “why it is important to fix it”. To answer your question, there is no real need to sort by id specifically. The real need is to be able to scroll the data among several queries, when requested size is over 10000 hits, and avoid pagination problems (duplicates or misses). So to do that, an efficient way should be to sort by “_doc”, that seems to be the recommendation up to Elastic documentation. What do you think about this change?

gplechuck commented 2 years ago

Hi @fbaligand ,

The suggestion to sort by _doc sounds very promising. I've done a small bit of testing - modified the plugin to use _doc instead of _id -

# from file .data_load/enhanced_table_request_handler.js , line 50, plugin, enhanced-table-1.13.0_7.6.1

  if ((visParams.hitsSize !== undefined && visParams.hitsSize > MAX_HITS_SIZE) || visParams.csvFullExport) {
      searchSourceFields.sort.push({'_doc': 'asc' });
    }

The resulting sort portion of the queries now looks like this (when we exceed 10000 hits) -

  "sort": [
    {
      "_score": {
        "order": "desc"
      }
    },
    {
      "_doc": {
        "order": "asc",
        "unmapped_type": "boolean"
      }
    }
  ],

I can confirm that queries that previously resulted in increases in large amounts of _id fielddata , no longer do so, so that's a step in the right direction I think .

I see that the plugin uses the 'search after' feature, and that Elasticsearch recommend the use of _doc for sorting on here - https://www.elastic.co/guide/en/elasticsearch/reference/current/sort-search-results.html

I have a small concern though, might be nothing , but worth testing..... I didn't personally see duplicates or missing records in tests I did so far (the count of docs in the table matched the expected count of events, no duplications etc... ) , but I can also see that the _doc value is different depending on whether the result is returned from a primary or a replica shard. To test that, use the Kibana dev tools console and search for a single doc , in an index with one or more replicas , and sort by _doc . Retry the query several times , you can see in the results that the _doc value differs . I have no idea offhand if this could be a potential issue when using the search after feature , or if its a solved problem, perhaps you would have a better idea .

Cheers

gplechuck commented 2 years ago

Further to previous comment ... have seen some duplicate entries in a recent test unfortunately. Tested loading a visualization from 1 hour ago to 'now' in a cluster where documents are often updated multiple times shortly after creation to enrich them with further information. Subsequent document updates change the _doc number. I suspect if a doc is updated while a long running visualization is loading (paginating through hits ) that it could potentially cause a problem with the sorting - again, not sure , but worth keeping in mind and testing further .

gplechuck commented 2 years ago

A new field called '_shard_doc' has been added to recent versions of Elasticsearch, which could potentially be used in the sort - but it may have been added after version 7.10 which might be an issue if you want to remain compatible with OSS Elasticsearch and earlier versions. Perhaps making the sort tiebreaker default to '_id', but be a configurable option, could be useful to allow people to set their own unique sort fields.

fbaligand commented 2 years ago

Thanks a lot for your tests. Well, the plugin has a git branch for every Kibana breaking version. There is especially a 7.9 branch and 7.11 branch (and more). We can imagine to use "_doc" for branches <= 7.9 and "_shard_doc" for branches >= 7.11.

Do you have tested "_shard_doc"? No memory problem? No duplicates or misses?

gplechuck commented 2 years ago

Hey, no problem. I didn't test _shard_doc (I'm running an earlier version of Elasticsearch) , it seems it was added in version 7.12. I mentioned version 7.10 as that's the last common version before the Opensearch fork.

https://www.elastic.co/guide/en/elasticsearch/reference/7.12/paginate-search-results.html#paginate-search-results

I'm not sure "_doc" will be 100% reliable for sorting in earlier versions (having seen duplicates in the limited testing I did) , but I could be wrong.

Perhaps making this sort field configurable (but default to _id) would be a compromise, a different unique field could be selected by users to ensure sort order without loading field data in the heap.

fbaligand commented 2 years ago

Well, I don’t think that sort by _id protects better against duplicates and misses, especially if id is not incremental.

fbaligand commented 2 years ago

The only safe way to avoid duplicates or misses would be to use pit feature. But it’s complicated to use this as it requires an additional specific http request, and I don’t know how to do that in my front end plugin.

gplechuck commented 2 years ago

Hey @fbaligand

The_id field would be unique and remain unchanged whereas the value of the _doc field can change in some conditions.

For my own use case, there is a unique field that can be selected in the visualization as the sort field, and the combination of sorting on this unique field and _doc is proving effective, memory consumption is lighter and I'm not seeing duplicates . However if using the default sort field of _score in conjunction with _doc, there might be some scope to miss or duplicate events in the paginated searches.

fbaligand commented 2 years ago

Well, _score sort is rarely the good sort in kibana, especially, if you want to get a lot of hits, or worst, download all the data. It is the default in plugin, because it is elasticsearch default, and _score is generic (it exists for everybody) Using your specific unique field is a good idea.

Concerning _id, indeed, it will not change for a document. But between 2 search queries, you can have new ids that can generate misses. And as you explain, it can be dangerous for es memory. So I’m not convinced by the idea of making it a plugin option. For most users, this setting is weird or understandable, and it is not very useful, given that _doc and _shard_doc make the job and are recommended by Elastic.

gplechuck commented 2 years ago

Sounds good, thanks. Bear in mind that if the documents are frequently updated (the cluster I was testing can have several document updates shortly after creation), their _doc values will change.

fbaligand commented 2 years ago

Hi @gplechuck,

I have good news! Enhanced Table plugin v1.13.3 has been release with use of '_doc' for sort instead of '_id': https://github.com/fbaligand/kibana-enhanced-table/releases/tag/v1.13.3

gplechuck commented 1 year ago

Excellent, thanks!

fbaligand / kibana-enhanced-table

Sorting by '_id' in Document Table visualization potentially causes large increase in heap usage (field data) #274

Discussed in https://github.com/fbaligand/kibana-enhanced-table/discussions/271