droberts195 commented 2 years ago

A user saw an OOM on an ML node while a categorization job was running on an index with very large message fields. This stack trace from the heap dump shows why:

The datafeed searched for 1000 documents
The datafeed then needs to extract the relevant fields and create input for the ML post data endpoint to send the data to the job
As it reaches the 1000th search hit in the list the following data will be in memory:
1. The original list of 1000 search hits containing the data in UTF-8 format
2. A Map<String, Object> representation of the 1000th search hit, storing the strings from this search hit in UTF-16 format
3. A ByteArrayOutputStream storing all the fields to be sent to the post data endpoint in UTF-8 format

Therefore, we are storing all the large message fields at least twice, and one of them 3 times. We should aim to release the original search hits one by one as they are converted to the post data input instead of releasing them all at the end of the conversion.

It is important that the datafeed doesn't use substantially more memory than the size of the original search hits, as extra memory used by the datafeed is not protected by any circuit breaker. The datafeed needs to try to restrict its usage to the size of the original search hits.

elasticsearchmachine commented 2 years ago

Pinging @elastic/ml-core (Team:ML)

afn7081 commented 1 year ago

Hi can i take up this issue?

droberts195 commented 1 year ago

Hi @afn7081. If you are able to create a PR to fix this then that would be great. If you don’t know the ML code then it won’t necessarily be as trivial as the good first issue label might suggest though.

i-plusplus commented 1 year ago

I am new to contributing in open source and on this project. I observed this issue was updated 7 weeks ago. Anyone working on this issue ? If no-one is working on it, can I take it up ?

droberts195 commented 1 year ago

@i-plusplus if you think you know the ML code well enough to fix it then that would be great. But if you've never looked at the ML code you won't necessarily find it as trivial as the good first issue label might suggest.

i-plusplus commented 1 year ago

Thanks @droberts195 I will love to give it a try. I will create a PR. If I fail, I will inform back.

i-plusplus commented 1 year ago

@droberts195 do you have steps to reproduce this issue?

i-plusplus commented 1 year ago

@droberts195 can you please look at this change - https://github.com/elastic/elasticsearch/commit/ec0bb3fbc82e01f14b0e895a48dd05000cf4666f.

This is not final change, I haven't yet tested this change or added unit tests. I just want to check with you if I am in right direction.

I didn't find the Map<String, Object> storage of the data. This change should solve the 1st & 3rd memory hogging problems you mention in bug report.

i-plusplus commented 1 year ago

Also, my current code change is not clean and makes method stateless. Please suggest some better of doing it.

droberts195 commented 1 year ago

I just want to check with you if I am in right direction.

I think you are heading in the right direction.

I didn't find the Map<String, Object> storage of the data.

It's here: https://github.com/elastic/elasticsearch/blob/6ccaa94c9fedf257922e71fc6040aa97814ca7c4/x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/extractor/SourceField.java#L31

But the map is owned by the SearchHit. So I think you have already covered this by setting all references to the SearchHit to null so that it can be garbage collected if necessary. Then its member variables that aren't referenced elsewhere can also be garbage collected.

Also, my current code change is not clean and makes method stateless. Please suggest some better of doing it.

I think that is inevitable to some extent with this change. It was always going to involve setting some references to null before the objects that originally contained them were completely finished with.

One possibility for reducing this would be to move the extraction of scrollId out of the method, so that the only part of the response it uses is the hits. So rename the method to processAndConsumeSearchHits, and change the places it's called from to look like this:

        scrollId = searchResponse.getScrollId();
        SearchHit[] hits = searchResponse.getHits().getHits();
        searchResponse = null;
        return processAndConsumeSearchHits(hits);

We'll still be setting array elements to null as soon as they're used, but maybe it's not as bad as modifying a deeply nested part of a SearchResponse.

Also, setting the SearchResponse to null early may be helpful in other ways, if the complex inner structure of the SearchResponse already somehow has multiple references to the array of hits.

i-plusplus commented 1 year ago

Thanks @droberts195,

I have created PR-https://github.com/elastic/elasticsearch/pull/98324.

droberts195 commented 1 year ago

98324 solved the worst aspects of this.

There are still a couple more changes that could be made:

Something similar to #98324 but for when the datafeed is using aggregations rather than scroll.
Use something more advanced than ByteArrayOutputStream for building up the input to the post data request. ByteArrayOutputStream doubles the size of its buffer each time it needs more space. As the buffer gets larger this could cause significant wastage, for example doubling from 16MB to 32MB. If we only have a few more documents to process when this happens it's even more wasteful. BytesReference has much better functionality for incrementally adding more space without duplicating during expansion.

i-plusplus commented 1 year ago

Great catch @droberts195 . Can I take it up ?

droberts195 commented 1 year ago

@i-plusplus sure, if you’d like to submit a new PR to make one or other of these improvements that would be great.

i-plusplus commented 1 year ago

@droberts195 , I have created this PR.

elastic / elasticsearch

[ML] Potentially large data held in multiple formats by ML datafeed #89769

98324 solved the worst aspects of this.