Snowstorm induced memory leak in Elasticsearch?

IHTSDO / snowstorm

Scalable SNOMED CT Terminology Server using Elasticsearch

Other

204 stars 80 forks source link

Snowstorm induced memory leak in Elasticsearch? #31

Closed nhnicwaller closed 5 years ago

nhnicwaller commented 5 years ago

I'm running Snowstorm server in Amazon Web Services, in combination with a hosted Elasticsearch service provided by Amazon Elasticsearch Service. This has been running for about two months now, and today I noticed some undesirable trends in metrics corresponding to our Elasticsearch instance for Snowstorm. In the last 63 days...

JVMMemoryPressure increased from 28.5% to 65.9% in a stepwise fashion, with steps occurring approximately every 4 hours. This correlates with a brief spike in DiskQueueDepth, which normally holds at 0.
JVMGCYoungCollectionCount and JVMGCYoungCollectionTime are increasingly linearly over time, with no apparent connection to the steps shown in JVMMemoryPressure.

Could Snowstorm be performing some regular, routine process that is leading to the buildup of objects in Elasticsearch and causing a memory leak?

I'm using Snowstorm 2.1.0 and Elasticsearch 6.3.

nhnicwaller commented 5 years ago

For what it's worth, restarting Elasticsearch (by re-composing the cluster) caused JVMMemoryPressure to drop and become nominal again, for now.

kaicode commented 5 years ago

Hi @nhnicwaller, thanks for raising this.

Have you seen the AWS Elasticsearch instance run out of memory? I wonder if the memory pressure will level out once Elasticsearch is using what it needs.

At this time Snowstorm has no routine processes and I don't think there are any in the spring-data-elasticsearch framework which we are using either. To test this I started an instance of Snowstorm 2.1.0 then immediately shut down Elasticsearch and left it for 5 hours. So far Snowstorm has not noticed that Elasticsearch is not up and I'm not expecting it to ever notice because it doesn't hit the Elasticsearch REST API until we request content.

What is the usage pattern? Is content being requested from Snowstorm at a fairly steady state while the AWS Elasticsearch instance memory usage climbs? I'm wondering if Snowstorm has left an Elasticsearch scroll request open although these should timeout themselves.

Elasticsearch does have native housekeeping which runs on a scheduled basis. I wonder if the Elasticsearch instance would behave the same way without any requests coming in.

Keep us posted.

nhnicwaller commented 5 years ago

I did not see the AWS Elasticsearch run out of memory to the point of failure, nor did I see memory usage begin to plateau. It's possible that it would have plateaued if I had left it alone, but this is the memory threshold at which I've started to see reliability issues in other ES clusters so I felt that some action was required.

Usage pattern:

A constant level of HTTP 2XX responses (with no 3XX, 4XX, or 5XX), between 60 and 90 per hour. This might just be a healthcheck on the cluster.
All of these hold steady at 0: DiskQueueDepth, ElasticsearchRequests, IndexingRate
ReadIOPS and ReadThroughput have occasional spikes, perhaps daily
SearchableDocuments holds steady at 11.8M
SearchRate holds steady at 0

I guess the next thing for me to do is stop our Snowstorm process and see if I can still observe the same behaviour from AWS Elasticsearch. If so, we could rule out Snowstorm itself.

nhnicwaller commented 5 years ago

I halted my Snowstorm process and observed that JVMMemoryPressure in AWS-ES continued climbing as before. That seems to rule out the possibility that Snowstorm itself is involved, so I'll close this issue.

kaicode commented 5 years ago

@nhnicwaller thank you for giving us a definitive answer on the role of Snowstorm in this case. If you do gain any insights into your AWS-ES hosting or configuration that you would like to share here for the benefit of others please feel free.