Open filak opened 3 weeks ago
Hello, I am taking a look at your issues with the 2024 MeSH RDF and will get you an update as to the issues you are seeing in the data. For future reference please submit future issues and bugs to https://support.nlm.nih.gov/ as we will be retiring the Github Issues section and directing all customer requests to our Help Desk for faster responses. I will update this issue when I have looked into the issues.
Firstly, thank you for all the hard work on making the transformation.
I have tried recently to run the script on MeSH 2024 in preparation for the 2025 version and I have observed some issues:
I had to disable the message "Warning: literal value that has leading or trailing whitespace" - as it was polluting the output
I have also replaced the related whitespace removal code with
I have updated all the XSL files to UTF-8 encoding
When running the script on the data there were some errors while converting supp2024.xml file - it seems the suppl.xsl is outdated. The Transformation rule: SCRClass does not have options for values 5 and 6 - so I added SCR_Population and SCR_Anatomy - though these are not yet defined in the vocabulary.owl
I do not know what the OASIS_CATALOG variable is used for - I have not found any mention of it in the docs - though the script works fine without setting it
Finally I have got the final mesh.nt.gz and I compared it to the official 2024 version and found some differences:
The whitespace in values in the official dataset - there is no leading or trailing whitespace in the new dataset - which is OK
The http://id.nlm.nih.gov/mesh/vocab#active triples are missing in the new dataset - this is not OK
The "inactive" items - http://id.nlm.nih.gov/mesh/vocab#active false - are missing completely - for example http://id.nlm.nih.gov/mesh/D013749 - these records are missing in the source XML - how to get these records into the dataset ?
I have pushed all the updates into the forked repo - https://github.com/filak/meshrdf/tree/fix-2024
Please, would you be so kind and give some advice on the issues - at least 6. and 7. - because I definitely need to have this data in the dataset.