IHTSDO / snowstorm

Scalable SNOMED CT Terminology Server using Elasticsearch
Other
208 stars 83 forks source link

RF2 Import exception with expired ES Scrolls #258

Open rowanvanbeckhoven opened 3 years ago

rowanvanbeckhoven commented 3 years ago

During the initial import of the RF2 snapshot, the ElasticSearch cluster is unable to return a scroll to Snowstorm. This is because this scroll is already expired, which causes an exception in Snowstorm.

2021-04-21 13:47:31.923  WARN 1 --- [ool-9-thread-15] o.s.d.e.core.ElasticsearchRestTemplate   : Could not clear scroll: Unable to parse response body; nested exception is ElasticsearchStatusException[Unable to parse response body]; nested: ResponseException[method [DELETE], host [XXX], URI [/_search/scroll], status line [HTTP/1.1 404 Not Found]
{"succeeded":true,"num_freed":0}];
2021-04-21 13:47:31.927 ERROR 1 --- [ool-9-thread-15] o.ihtsdo.otf.snomedboot.ReleaseImporter  : Failed to read or process lines.

org.springframework.data.elasticsearch.UncategorizedElasticsearchException: Elasticsearch exception [type=search_phase_execution_exception, reason=all shards failed]; nested exception is ElasticsearchStatusException[Elasticsearch exception [type=search_phase_execution_exception, reason=all shards failed]]; nested: ElasticsearchException[Elasticsearch exception [type=search_context_missing_exception, reason=No search context found for id [323524]]];
        at org.springframework.data.elasticsearch.core.ElasticsearchExceptionTranslator.translateExceptionIfPossible(ElasticsearchExceptionTranslator.java:67)
        at org.springframework.data.elasticsearch.core.ElasticsearchRestTemplate.translateException(ElasticsearchRestTemplate.java:398)
        at org.springframework.data.elasticsearch.core.ElasticsearchRestTemplate.execute(ElasticsearchRestTemplate.java:381)
        at org.springframework.data.elasticsearch.core.ElasticsearchRestTemplate.searchScrollContinue(ElasticsearchRestTemplate.java:318)
        at org.springframework.data.elasticsearch.core.AbstractElasticsearchTemplate.lambda$searchForStream$2(AbstractElasticsearchTemplate.java:267)
        at org.springframework.data.elasticsearch.core.StreamQueries$1.hasNext(StreamQueries.java:104)
        at java.util.Iterator.forEachRemaining(Iterator.java:115)
        at org.snomed.snowstorm.core.data.services.ReferenceSetMemberService.doSaveBatchMembers(ReferenceSetMemberService.java:286)
        at org.snomed.snowstorm.core.rf2.rf2import.ImportComponentFactoryImpl$4.persistCollection(ImportComponentFactoryImpl.java:114)
        at org.snomed.snowstorm.core.rf2.rf2import.ImportComponentFactoryImpl$PersistBuffer.flush(ImportComponentFactoryImpl.java:327)
        at org.snomed.snowstorm.core.rf2.rf2import.ImportComponentFactoryImpl$PersistBuffer.save(ImportComponentFactoryImpl.java:322)
        at org.snomed.snowstorm.core.rf2.rf2import.ImportComponentFactoryImpl.newReferenceSetMemberState(ImportComponentFactoryImpl.java:292)
        at org.ihtsdo.otf.snomedboot.ReleaseImporter$ImportRun.lambda$loadRefsets$4(ReleaseImporter.java:508)
        at org.ihtsdo.otf.snomedboot.ReleaseImporter$ImportRun.readLines(ReleaseImporter.java:615)
        at org.ihtsdo.otf.snomedboot.ReleaseImporter$ImportRun.lambda$readLinesCallable$5(ReleaseImporter.java:527)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
Caused by: org.elasticsearch.ElasticsearchStatusException: Elasticsearch exception [type=search_phase_execution_exception, reason=all shards failed]
        at org.elasticsearch.rest.BytesRestResponse.errorFromXContent(BytesRestResponse.java:177)
        at org.elasticsearch.client.RestHighLevelClient.parseEntity(RestHighLevelClient.java:1888)
        at org.elasticsearch.client.RestHighLevelClient.parseResponseException(RestHighLevelClient.java:1865)
        at org.elasticsearch.client.RestHighLevelClient.internalPerformRequest(RestHighLevelClient.java:1622)
        at org.elasticsearch.client.RestHighLevelClient.performRequest(RestHighLevelClient.java:1579)
        at org.elasticsearch.client.RestHighLevelClient.performRequestAndParseEntity(RestHighLevelClient.java:1549)
        at org.elasticsearch.client.RestHighLevelClient.scroll(RestHighLevelClient.java:1174)
        at org.springframework.data.elasticsearch.core.ElasticsearchRestTemplate.lambda$searchScrollContinue$13(ElasticsearchRestTemplate.java:318)
        at org.springframework.data.elasticsearch.core.ElasticsearchRestTemplate.execute(ElasticsearchRestTemplate.java:379)
        ... 16 common frames omitted
        Suppressed: org.elasticsearch.client.ResponseException: method [POST], host [XXX], URI [/_search/scroll], status line [HTTP/1.1 404 Not Found]
{"error":{"root_cause":[{"type":"search_context_missing_exception","reason":"No search context found for id [323524]"}],"type":"search_phase_execution_exception","reason":"all shards failed","phase":"query","grouped":true,"failed_shards":[{"shard":-1,"index":null,"reason":{"type":"search_context_missing_exception","reason":"No search context found for id [323524]"}}],"caused_by":{"type":"search_context_missing_exception","reason":"No search context found for id [323524]"}},"status":404}
                at org.elasticsearch.client.RestClient.convertResponse(RestClient.java:283)
                at org.elasticsearch.client.RestClient.performRequest(RestClient.java:261)
                at org.elasticsearch.client.RestClient.performRequest(RestClient.java:235)
                at org.elasticsearch.client.RestHighLevelClient.internalPerformRequest(RestHighLevelClient.java:1609)
                ... 21 common frames omitted
Caused by: org.elasticsearch.ElasticsearchException: Elasticsearch exception [type=search_context_missing_exception, reason=No search context found for id [323524]]
        at org.elasticsearch.ElasticsearchException.innerFromXContent(ElasticsearchException.java:496)
        at org.elasticsearch.ElasticsearchException.fromXContent(ElasticsearchException.java:407)
        at org.elasticsearch.ElasticsearchException.innerFromXContent(ElasticsearchException.java:437)
        at org.elasticsearch.ElasticsearchException.failureFromXContent(ElasticsearchException.java:603)
        at org.elasticsearch.rest.BytesRestResponse.errorFromXContent(BytesRestResponse.java:169)
        ... 24 common frames omitted

I've tried to set the elasticvc.batch.size smaller, but this does not seem to have any effect. The Elasticsearch cluster I'm using is an AWS Elasticsearch Service with 3 nodes. Tried both version 7.7 and 7.9 of ES, but that does not seem to have any effect either.

This exception mostly occurs during the final phase of the import, but also occurs during the initial load of the data (saving batch of x)

Following the suggestions made in https://github.com/elastic/elasticsearch/issues/65381, the scroll should be extended, such that the context is kept for a longer duration. I can't seem to find how to configure this and am not entirely sure if this is the right way to fix this issue.

kaicode commented 3 years ago

Hi @rowanvanbeckhoven,

I'm sorry to hear you are having issues getting your initial data load to complete. Thanks for reaching out with your problem and including lots of detail.

This is caused when the Elasticsearch cluster can not index the documents as fast as expected by the default configuration in Snowstorm. There is documentation online about performance considerations for Elasticsearch but the initial data load is a very heavy one-off process so it's best to just work around it.

Please set the elasticvc.save.batch-size configuration option. The default is 5000 so you could try something like 1000. For example: java -Xms4g -jar target/snowstorm-6.2.1.jar --elasticvc.save.batch-size=1000 --delete-indices --import=/opt/releases/SnomedCT_InternationalRF2_PRODUCTION_20210131T120000Z.zip

You will see this option take effect as soon as the import starts and the concept documents are saved:

2021-04-22 11:51:09.956  INFO 28098 --- [           main] io.kaicode.elasticvc.api.BranchService   : Open commit on MAIN at 1619088669956
2021-04-22 11:51:09.957  INFO 28098 --- [           main] o.ihtsdo.otf.snomedboot.ReleaseImporter  : Reading concepts 
2021-04-22 11:51:10.141  INFO 28098 --- [           main] i.k.elasticvc.api.ComponentService       : Saving batch of 1000 Concepts
2021-04-22 11:51:10.575  INFO 28098 --- [           main] i.k.elasticvc.api.ComponentService       : Saving batch of 1000 Concepts
2021-04-22 11:51:10.825  INFO 28098 --- [           main] i.k.elasticvc.api.ComponentService       : Saving batch of 1000 Concepts
2021-04-22 11:51:11.002  INFO 28098 --- [           main] i.k.elasticvc.api.ComponentService       : Saving batch of 1000 Concepts
2021-04-22 11:51:11.195  INFO 28098 --- [           main] i.k.elasticvc.api.ComponentService       : Saving batch of 1000 Concepts
2021-04-22 11:51:11.620  INFO 28098 --- [           main] i.k.elasticvc.api.ComponentService       : Saving batch of 1000 Concepts
2021-04-22 11:51:11.810  INFO 28098 --- [           main] i.k.elasticvc.api.ComponentService       : Saving batch of 1000 Concepts

This will also be applied to the documents which are saved into the semantic index right at the end of the import, which sounds like where your import is failing.

I hope this helps! Kind regards, Kai Kewley

rowanvanbeckhoven commented 3 years ago

Ah, this is exactly what I meant by the 'elasticvc.batch.size' I mentioned earlier. I've tried it exactly with those parameters. But this doesn't fix this issue unfortunately.

kaicode commented 3 years ago

In your logs do you see the size of the batches being saved change to the size you set, as in the example I gave? I'm wondering if the configuration has taken properly..

rowanvanbeckhoven commented 3 years ago

Yeah, it works properly:

2021-04-21 13:44:59.606  INFO 1 --- [pool-9-thread-8] i.k.elasticvc.api.ComponentService       : Saving batch of 1000 ReferenceSetMembers
2021-04-21 13:44:59.735  INFO 1 --- [pool-9-thread-8] i.k.elasticvc.api.ComponentService       : Saving batch of 1000 ReferenceSetMembers
2021-04-21 13:44:59.883  INFO 1 --- [pool-9-thread-8] i.k.elasticvc.api.ComponentService       : Saving batch of 1000 ReferenceSetMembers
2021-04-21 13:47:31.923  WARN 1 --- [ool-9-thread-15] o.s.d.e.core.ElasticsearchRestTemplate   : Could not clear scroll: Unable to parse response body; nested exception is ElasticsearchStatusException[Unable to parse response body]; nested: ResponseException[method [DELETE], host [XXX], URI [/_search/scroll], status line [HTTP/1.1 404 Not Found]
{"succeeded":true,"num_freed":0}];
2021-04-21 13:47:31.927 ERROR 1 --- [ool-9-thread-15] o.ihtsdo.otf.snomedboot.ReleaseImporter  : Failed to read or process lines.

org.springframework.data.elasticsearch.UncategorizedElasticsearchException: Elasticsearch exception [type=search_phase_execution_exception, reason=all shards failed]; nested exception is ElasticsearchStatusException[Elasticsearch exception [type=search_phase_execution_exception, reason=all shards failed]]; nested: ElasticsearchException[Elasticsearch exception [type=search_context_missing_exception, reason=No search context found for id [323524]]];
        at org.springframework.data.elasticsearch.core.ElasticsearchExceptionTranslator.translateExceptionIfPossible(ElasticsearchExceptionTranslator.java:67)
        at org.springframework.data.elasticsearch.core.ElasticsearchRestTemplate.translateException(ElasticsearchRestTemplate.java:398)
        at org.springframework.data.elasticsearch.core.ElasticsearchRestTemplate.execute(ElasticsearchRestTemplate.java:381)
        at org.springframework.data.elasticsearch.core.ElasticsearchRestTemplate.searchScrollContinue(ElasticsearchRestTemplate.java:318)
        at org.springframework.data.elasticsearch.core.AbstractElasticsearchTemplate.lambda$searchForStream$2(AbstractElasticsearchTemplate.java:267)
        at org.springframework.data.elasticsearch.core.StreamQueries$1.hasNext(StreamQueries.java:104)
        at java.util.Iterator.forEachRemaining(Iterator.java:115)
kaicode commented 3 years ago

You could set the batch size even lower, say 500, but I wondering what else is going on to cause this. I would take a look at your Elasticsearch logs to see what they say. The cluster doesn't seem very healthy. You could also try the Cluster Health API.

rowanvanbeckhoven commented 3 years ago

Yeah I assumed that as well. There were some short moments where one of the nodes was down, causing a yellow status on the cluster. These downtimes were roughly 1 min, and did not concern the master node (according to AWS). Do you happen to know if there's a config parameter that I can use to override the default scroll lifespan?

Besides, it seems probable that a node could be unreachable at some point in time. It could be nice to have some fallback in the import that allows for certain timeouts, without relying on the existence of a scroll context. I'm not fully aware of how Snowstorm uses the Scroll api of ES though, so it could be more difficult than I imagine at this point.

kaicode commented 3 years ago

Snowstorm does not interface with the Elasticsearch Scroll API directly but rather uses the Spring Data Elasticsearch framework. It may be possible but shouldn't be necessary if Elasticsearch is stable. If you have the option of provisioning a more powerful cluster for the initial import, and then downgrading once complete, that could be a good strategy.