IHTSDO / snowstorm

Scalable SNOMED CT Terminology Server using Elasticsearch
Other
208 stars 83 forks source link

Missing relationships after importing extension #45

Closed dkincaid closed 5 years ago

dkincaid commented 5 years ago

High level summary - after importing the SNOMED International version followed by the SNOMED Veterinary Extension some relationships are missing from the child branch created for the extension when the extension contains inactive relationships with earlier effective times than the International edition.

Here are the steps I followed

  1. Startup a clean instance of Elasticsearch 6.4.2
  2. Startup snowstorm 2.2.3 using java -Xms2g -Xmx2g -jar target/snowstorm*.jar
  3. Follow the "Loading SNOMED into Snowstorm" guide to import the SNOMED International edtiion into the MAIN branch
  4. Modify the Veterinary Extension RF2 release files to reformat the effectiveTime in all files to YYYYMMDD format and rezip the release files
  5. Follow the "Loading & updating SNOMED CT with local Extensions or Editions" guide to import the Veterinary Extension into the MAIN/SNOMED-VET branch

After finishing this process the issue is seen by calling the findConceptParents endpoint and specifying the following parameters:

branch = MAIN/SNOMED-VET
conceptId = 81260002
form = inferred
Accept-Language = en-US;q=0.8,en-GB;q=0.6

the response code is a 200 and the response body is an empty array ([]).

If I make the same endpoint call but change the branch to MAIN I get one parent returned, conceptId 321351000009104.

In the SNOMED International edition Relationship file this relationship is present and active with effectiveTime = 20160131:

6412388027  20160131    1   900000000000207008  81260002    321351000009104 0   116680003   900000000000011006  900000000000451002

in the Veterinary extension Relationship file the relationship also exists but is inactive with effectiveTime = 20160130:

739111000009126 20160130    0   332351000009108 81260002    321351000009104 0   116680003   900000000000011006  900000000000451002

I am also attaching the log output from the import of the extension file.

vetext-snowstorm-import-log.txt

Please let me know if there is any other information I can provide or troubleshooting I can help with.

kaicode commented 5 years ago

The Snowstorm "semantic index" is an index of all concept parents, ancestors, attributes and attribute groups. This is used to answer the findConceptParents call and other hierarchy and ECL queries.

Initial thoughts:

There is a workaround for this scenario. Could you try rebuilding the semantic index please? This can be done using the rebuildBranchTransitiveClosure function under the Concepts area of Swagger. (This will move to the new Admin area of Swagger in v3.x)

kaicode commented 5 years ago

There are a few duplicate triples like this in the International Snapshot. When building the semantic index we sort by effectiveTime and active to get the most effective relationships in the right order for processing but it looks like avoiding duplicate triples with different relationship ids is not working when importing a delta. I would be interested to hear if rebuilding the semantic index solves this.

dkincaid commented 5 years ago

I just ran the rebuild. Now I do get back that parent when I query the MAIN/SNOMED-VET endpoint, but it is very slow to return (like 6-7 seconds). Before it was pretty much instantaneous. I also see this log message output when I call that endpoint now:

2019-04-30 12:04:32.967  WARN 2794 --- [/O dispatcher 1] org.elasticsearch.client.RestClient      : request [GET http://localhost:9200/es-query/query-concept/_search?typed_keys=true&ignore_unavailable=false&expand_wildcards=open&allow_no_indices=true&search_type=dfs_query_then_fetch&batched_reduce_size=512] returned 1 warnings: 
[299 Elasticsearch-6.4.2-04711c2 "Deprecated: the number of terms [699096] used in the Terms Query
 request has exceeded the allowed maximum of [65536]. This maximum can be set by changing the 
[index.max_terms_count] index level setting." "Tue, 30 Apr 2019 17:04:25 GMT"]

that seems surprising for just a single concept parent query.

kaicode commented 5 years ago

Thanks for trying that. I'm glad you are getting the desired parents back now. The semantic index rebuild on this branch had quite a performance impact didn't it!

It's slower now because we now have two full semantic indexes sitting on top of each other. One on MAIN and the other on the MAIN/SNOMED-VET branch. The large query and slower query time is because the query clause is excluding all the concepts in the MAIN semantic index after they were all replaced when it was rebuilt on SNOMED-VET. Just about the only weakness of Snowstorm is that if you replace tens of thousands of components on branches other than MAIN things will start to slow down.

I've marked this down as a bug. It's going to take some thought to solve this without impacting the performance of the incremental semantic index update. Thanks for reporting the issue.

kaicode commented 5 years ago

@dkincaid If you would like this working now another workaround you could try is to import the vet extension into MAIN then rebuild the semantic index on MAIN and just not use the SNOMED-VET branch. That should give you fast consistent results until this bug can be fixed.

kaicode commented 5 years ago

Hi @dkincaid,

In version 4.1.0 of Snowstorm we have updated the semantic index update function to use all active triples (source, type and destination concept) when processing each relationship change. This was necessary because in the US Edition there are over one hundred cases of triples being made inactive in the US module straight after the same triple is made active in the International module. The inactivation in the US module is done using a different relationship id but Snowstorm was making the triple inactive until this fix.

This should also fix the issue you were seeing where relationships were going missing because I believe this was happening for the very same reason. This fix should give you accurate child/parent/ECL results straight after the RF2 import. The workaround we tried before gives me confidence that v4.1.0 (or later) will work for you without wrecking your performance.

I just thought I should let you know in case you have time to try it again. I can recommend deleting all your Snowstorm Elasticsearch indexes and starting a fresh because some of the index mappings have changed to support better non-english search and other features. We still require just the date in the effectiveTime field so remember to simplify those if you do import the Vet Extension.

I hope you are tempted to try! 😄

Kind regards, Kai

kaicode commented 5 years ago

Closing this ticket because I believe it's fixed in 4.1.0. Please add comments or reopen the ticket as required.