Open rowanvanbeckhoven opened 3 years ago
Hi @rowanvanbeckhoven,
I'm sorry to hear you are having issues getting your initial data load to complete. Thanks for reaching out with your problem and including lots of detail.
This is caused when the Elasticsearch cluster can not index the documents as fast as expected by the default configuration in Snowstorm. There is documentation online about performance considerations for Elasticsearch but the initial data load is a very heavy one-off process so it's best to just work around it.
Please set the elasticvc.save.batch-size
configuration option. The default is 5000 so you could try something like 1000.
For example: java -Xms4g -jar target/snowstorm-6.2.1.jar --elasticvc.save.batch-size=1000 --delete-indices --import=/opt/releases/SnomedCT_InternationalRF2_PRODUCTION_20210131T120000Z.zip
You will see this option take effect as soon as the import starts and the concept documents are saved:
2021-04-22 11:51:09.956 INFO 28098 --- [ main] io.kaicode.elasticvc.api.BranchService : Open commit on MAIN at 1619088669956
2021-04-22 11:51:09.957 INFO 28098 --- [ main] o.ihtsdo.otf.snomedboot.ReleaseImporter : Reading concepts
2021-04-22 11:51:10.141 INFO 28098 --- [ main] i.k.elasticvc.api.ComponentService : Saving batch of 1000 Concepts
2021-04-22 11:51:10.575 INFO 28098 --- [ main] i.k.elasticvc.api.ComponentService : Saving batch of 1000 Concepts
2021-04-22 11:51:10.825 INFO 28098 --- [ main] i.k.elasticvc.api.ComponentService : Saving batch of 1000 Concepts
2021-04-22 11:51:11.002 INFO 28098 --- [ main] i.k.elasticvc.api.ComponentService : Saving batch of 1000 Concepts
2021-04-22 11:51:11.195 INFO 28098 --- [ main] i.k.elasticvc.api.ComponentService : Saving batch of 1000 Concepts
2021-04-22 11:51:11.620 INFO 28098 --- [ main] i.k.elasticvc.api.ComponentService : Saving batch of 1000 Concepts
2021-04-22 11:51:11.810 INFO 28098 --- [ main] i.k.elasticvc.api.ComponentService : Saving batch of 1000 Concepts
This will also be applied to the documents which are saved into the semantic index right at the end of the import, which sounds like where your import is failing.
I hope this helps! Kind regards, Kai Kewley
Ah, this is exactly what I meant by the 'elasticvc.batch.size' I mentioned earlier. I've tried it exactly with those parameters. But this doesn't fix this issue unfortunately.
In your logs do you see the size of the batches being saved change to the size you set, as in the example I gave? I'm wondering if the configuration has taken properly..
Yeah, it works properly:
2021-04-21 13:44:59.606 INFO 1 --- [pool-9-thread-8] i.k.elasticvc.api.ComponentService : Saving batch of 1000 ReferenceSetMembers
2021-04-21 13:44:59.735 INFO 1 --- [pool-9-thread-8] i.k.elasticvc.api.ComponentService : Saving batch of 1000 ReferenceSetMembers
2021-04-21 13:44:59.883 INFO 1 --- [pool-9-thread-8] i.k.elasticvc.api.ComponentService : Saving batch of 1000 ReferenceSetMembers
2021-04-21 13:47:31.923 WARN 1 --- [ool-9-thread-15] o.s.d.e.core.ElasticsearchRestTemplate : Could not clear scroll: Unable to parse response body; nested exception is ElasticsearchStatusException[Unable to parse response body]; nested: ResponseException[method [DELETE], host [XXX], URI [/_search/scroll], status line [HTTP/1.1 404 Not Found]
{"succeeded":true,"num_freed":0}];
2021-04-21 13:47:31.927 ERROR 1 --- [ool-9-thread-15] o.ihtsdo.otf.snomedboot.ReleaseImporter : Failed to read or process lines.
org.springframework.data.elasticsearch.UncategorizedElasticsearchException: Elasticsearch exception [type=search_phase_execution_exception, reason=all shards failed]; nested exception is ElasticsearchStatusException[Elasticsearch exception [type=search_phase_execution_exception, reason=all shards failed]]; nested: ElasticsearchException[Elasticsearch exception [type=search_context_missing_exception, reason=No search context found for id [323524]]];
at org.springframework.data.elasticsearch.core.ElasticsearchExceptionTranslator.translateExceptionIfPossible(ElasticsearchExceptionTranslator.java:67)
at org.springframework.data.elasticsearch.core.ElasticsearchRestTemplate.translateException(ElasticsearchRestTemplate.java:398)
at org.springframework.data.elasticsearch.core.ElasticsearchRestTemplate.execute(ElasticsearchRestTemplate.java:381)
at org.springframework.data.elasticsearch.core.ElasticsearchRestTemplate.searchScrollContinue(ElasticsearchRestTemplate.java:318)
at org.springframework.data.elasticsearch.core.AbstractElasticsearchTemplate.lambda$searchForStream$2(AbstractElasticsearchTemplate.java:267)
at org.springframework.data.elasticsearch.core.StreamQueries$1.hasNext(StreamQueries.java:104)
at java.util.Iterator.forEachRemaining(Iterator.java:115)
You could set the batch size even lower, say 500, but I wondering what else is going on to cause this. I would take a look at your Elasticsearch logs to see what they say. The cluster doesn't seem very healthy. You could also try the Cluster Health API.
Yeah I assumed that as well. There were some short moments where one of the nodes was down, causing a yellow status on the cluster. These downtimes were roughly 1 min, and did not concern the master node (according to AWS). Do you happen to know if there's a config parameter that I can use to override the default scroll lifespan?
Besides, it seems probable that a node could be unreachable at some point in time. It could be nice to have some fallback in the import that allows for certain timeouts, without relying on the existence of a scroll context. I'm not fully aware of how Snowstorm uses the Scroll api of ES though, so it could be more difficult than I imagine at this point.
Snowstorm does not interface with the Elasticsearch Scroll API directly but rather uses the Spring Data Elasticsearch framework. It may be possible but shouldn't be necessary if Elasticsearch is stable. If you have the option of provisioning a more powerful cluster for the initial import, and then downgrading once complete, that could be a good strategy.
During the initial import of the RF2 snapshot, the ElasticSearch cluster is unable to return a scroll to Snowstorm. This is because this scroll is already expired, which causes an exception in Snowstorm.
I've tried to set the elasticvc.batch.size smaller, but this does not seem to have any effect. The Elasticsearch cluster I'm using is an AWS Elasticsearch Service with 3 nodes. Tried both version 7.7 and 7.9 of ES, but that does not seem to have any effect either.
This exception mostly occurs during the final phase of the import, but also occurs during the initial load of the data (saving batch of x)
Following the suggestions made in https://github.com/elastic/elasticsearch/issues/65381, the scroll should be extended, such that the context is kept for a longer duration. I can't seem to find how to configure this and am not entirely sure if this is the right way to fix this issue.