IHTSDO / snomed-database-loader

Represent SNOMED CT in a different types of databases
Other
107 stars 74 forks source link

[MySQL] Error if wanting to generate the Transitive Closure File #54

Open rjalexa opened 2 years ago

rjalexa commented 2 years ago
MySQL 8
SnomedCT_InternationalRF2_PRODUCTION_20221031T120000Z.zip input file

script invoked as: ./load_release.sh ../../../SNOMEDCT_INT_202210/SnomedCT_InternationalRF2_PRODUCTION_20221031T120000Z.zip snomedct20221031 FULL

If I reply Yes to the Calculate and store inferred transitive closure? [Y/N]: I see the following happening:

Including transitive closure table - transclos
Archive:  ../../../SNOMEDCT_INT_202210/SnomedCT_InternationalRF2_PRODUCTION_20221031T120000Z.zip
  inflating: tmp_extracted/._Full    
  inflating: tmp_extracted/._Refset  
  inflating: tmp_extracted/._Terminology  
  inflating: tmp_extracted/._Content  
  inflating: tmp_extracted/._Language  
  inflating: tmp_extracted/._Map     
  inflating: tmp_extracted/._Metadata  
  inflating: tmp_extracted/sct2_RelationshipConcreteValues_Full_INT_20221031.txt  
  inflating: tmp_extracted/sct2_sRefset_OWLExpressionFull_INT_20221031.txt  
  inflating: tmp_extracted/sct2_TextDefinition_Full-en_INT_20221031.txt  
  inflating: tmp_extracted/sct2_StatedRelationship_Full_INT_20221031.txt  
  inflating: tmp_extracted/sct2_Description_Full-en_INT_20221031.txt  
  inflating: tmp_extracted/sct2_Relationship_Full_INT_20221031.txt  
  inflating: tmp_extracted/sct2_Identifier_Full_INT_20221031.txt  
  inflating: tmp_extracted/sct2_Concept_Full_INT_20221031.txt  
  inflating: tmp_extracted/der2_cRefset_AssociationFull_INT_20221031.txt  
  inflating: tmp_extracted/der2_cRefset_AttributeValueFull_INT_20221031.txt  
  inflating: tmp_extracted/der2_Refset_SimpleFull_INT_20221031.txt  
  inflating: tmp_extracted/der2_cRefset_LanguageFull-en_INT_20221031.txt  
  inflating: tmp_extracted/der2_sRefset_SimpleMapFull_INT_20221031.txt  
  inflating: tmp_extracted/der2_iisssccRefset_ExtendedMapFull_INT_20221031.txt  
  inflating: tmp_extracted/der2_cissccRefset_MRCMAttributeDomainFull_INT_20221031.txt  
  inflating: tmp_extracted/der2_ciRefset_DescriptionTypeFull_INT_20221031.txt  
  inflating: tmp_extracted/der2_cciRefset_RefsetDescriptorFull_INT_20221031.txt  
  inflating: tmp_extracted/der2_ssRefset_ModuleDependencyFull_INT_20221031.txt  
  inflating: tmp_extracted/der2_cRefset_MRCMModuleScopeFull_INT_20221031.txt  
  inflating: tmp_extracted/der2_ssccRefset_MRCMAttributeRangeFull_INT_20221031.txt  
  inflating: tmp_extracted/der2_sssssssRefset_MRCMDomainFull_INT_20221031.txt  

Generating Environment script for FULL type(s)

Generating loading script for 20221031
Unable to find der2_iissscRefset_ComplexMapFull_INT_20221031.txt or beta version, skipping...
Skipping
mysql: [Warning] Using a password on the command line interface can be insecure.

Ensuring schema snomedct20221031 exists

(re)Creating Schema using tmp_environment-mysql.sql
Generating Transitive Closure file...
can't open tmp_extracted/xsct2_Relationship_Snapshot_INT_20221031.txt at ./transitiveClosureRf2Snap_dbCompatible.pl line 65.
pgwilliams commented 2 years ago

Oh yes, I think I see that. You've chosen to only import the Full files, but the transitive closure routine works with the Snapshot file. I've always imported 'A' for all files, and then the Snapshot file will be found. The reason that the xsct variant is being complained about is because it checks for sct2 first, and then xsct2 so the error message is being a bit lazy there.

I should add in a check that refuses to generate the transitive closure if the Snapshot files are not extracted, but in the meantime, please either select Snapshot or All if you want a transitive closure generated.

rjalexa commented 2 years ago

Bear with my ignorance please. I can only download the file mentioned in the first post following the SNOMED page links (https://uts.nlm.nih.gov/uts/). So what is the meaning of your suggestion on A or Snapshot? Not sure I know how to get the latter. Thanks a lot for your help.

pgwilliams commented 2 years ago

Oh sorry, I said select 'A' but I see that this code wants you to write out 'ALL' for your command line argument, instead of 'FULL' as you've done above here. Alternatively SNAPSHOT will be quicker if you don't need to know about recent or historical changes.

The zip file should contain 3 sections, one for Delta (which is just the changes since the last release), the Snapshot (which is the current state of the ontology) and the Full files (which contain the complete history of the product dating back to 2002). Unless you're doing something very specific with historical records, or tracking changes to the product over time, you're more likely to want the Snapshot. That's SNOMED CT as it exists today.

"Full" is perhaps a misleading term for what those files are - they're not just everything that is, they're also everything that was.

rjalexa commented 2 years ago

Very clear. Thanks. I therefore invoked the script as follows: ./load_release.sh ../../../SNOMEDCT_INT_202210/SnomedCT_InternationalRF2_PRODUCTION_20221031T120000Z.zip snomedct20221031 SNAP and works perfectly. Also thanks to your indication I now have well understood I really want the snapshot since I'm not interested in the historical dimension. Thanks a lot !!!! PS Oh how I wish SNOMED was available in RDF/OWL !!!! :)

pgwilliams commented 2 years ago

Oh but it is! Or can be, with a transformation. See https://github.com/IHTSDO/snomed-owl-toolkit

Enjoy!

rjalexa commented 2 years ago

One last question if I may. The loading went smoothly and have SNOMED's tables in the database. Using DBVisualizer I tried generating the ER view which shows the tables but no relationships between them. Maybe the load process should generate foreign keys and doesn't?

pgwilliams commented 2 years ago

Yes that would be a nice feature. We use Elasticsearch internally, so such improvements wouldn't have a direct business benefit for us I'm afraid. Hopefully someone will contribute such a thing as a pull request.

Here's a rough ER diagram showing the links between the core component types : https://confluence.ihtsdotools.org/display/DOCSTART/5.+SNOMED+CT+Logical+Model but also various refset members link to referenced concepts. If you have any specific questions I can answer those, or if it's just a more general "what's going on", then we have a ton of free education material available.

rjalexa commented 2 years ago

The OWL transformation tool worked flawlessly and personally loading it into Protegè to browse it gives me su much more insight to the monumental work than perusing the relational tables which at this point I'll probably drop. I wonder how come RDF is not THE main format you use to disseminate this. So much more expressive than a relational schema IMHO. Thanks a lot for your help. PS I kind of sense the presence of the ghost :) of Barry Smith the ontologist behind BFO ... am I right? :) PPS Stumbling on the concept of "Hang Glider" made my day. SNOMED almost describes all possible reality :) :)

pgwilliams commented 2 years ago

Hi Robert. Yes there's no doubt that OWL is much more expressive (especially in terms of nesting), and a number of compromises were made in order to fit the simplified distribution standard of RF2. However, as I said in my email, RF2 does also give us a number of advantages around tracking changes over time and information that cannot be expressed in OWL and file size efficiency was more of a concern back in the day. That said, we do now use OWL (albeit chopped up into a Reference Set) for our Stated View.

We have had meetings with the BFO crowd and put substantial thought into aligning SNOMED CT with BFO. I can point you in the direction of this discussion if you're interested in reading further. But no, Barry Smith was not involved directly with the design of SNOMED CT as far as I'm aware.

Yes I often post a "Concept of the Week" on a Friday for my colleagues. My pick for Halloween was 111946000 |Burial alive (finding)|

rjalexa commented 2 years ago

So SNOMED aims to exist under a closed world assumption: what is not there does not exist :) :) (joking of course) Thanks a lot.

pgwilliams commented 2 years ago

No, it's an important point. It's one of the main distinctions with a classification like ICD where everything must exist in exactly one bucket, and all the buckets add up to 100%.