NCI-Semantic-Infrastructure / shared-si-issues

Umbrella repo for Shared SI Service issues
BSD 3-Clause "New" or "Revised" License
0 stars 0 forks source link

"unknown" prefixes in caDSR RDF #17

Open gilaim07 opened 8 months ago

gilaim07 commented 8 months ago

[Noted sometime in Dec 26-29] The source of some vocabulary codes in caDSR is "unknown", although looking at those codes, for most of them it's easy to figure out what vocab they belong to. Will review code, see if some regex or source is funky.

gilaim07 commented 8 months ago

Turns out I had been looking at an old file where the prefix work was unfinished. However, it seems that the caDSR folks have started using all-caps for the source names. There are about 170 referenced concepts with the source in all-caps referencing the NCIt. The prefix assignment fails in this case. The source comparison for prefix assignment needs to be done case-insensitively.

gilaim07 commented 8 months ago

Per DW, the DB column that the XML was getting the source name from was wrong, the one it was using was a free text column that appears to recapitulate the correct column entry created using drop-downs. The caDSR XML export will be fixed on the caDSR side e.t.a. around the end of the month.

PS A plain "NCI" source in the XML which always seemed to refer to the ncit won't need fixing in the future as the correct source rather than "NCI" will be in the xml.

gilaim07 commented 8 months ago

A fixed version of de2x with case-insensitive source comparison has been pushed to the repo. Per above, it turned out it wasn't needed as the error was in the XML export, but it doesn't hurt. It's tagged as 1.0.1. This tag also includes a change in the makefile.template file in the src directory. A new makefile file needs to be generated in the src directory, simply copy makefile.template to makefile before running make in the parent directory.

In dev, ping for details if needed, run make from the parent directory: make clean && make for a quick sanity check: make test (de2x outputs quite a few details where things don't quite match the expectations, harmless)

to install in the directory called by the scripts in the dev,qa,stage,prod NCI contexts make install-system

If you wish to run the entire cadsr download/conversion & TS data load process from start to end, delete the existing dowloaded zip file from the download directory that the script uses, and then run ./master.sh from the scripts directory.