biopragmatics / bioversions

🪝 What's the latest version for each database?
https://biopragmatics.github.io/bioversions
MIT License
26 stars 7 forks source link

Add NCI Thesaurus #10

Closed jsstevenson closed 2 years ago

jsstevenson commented 2 years ago

Howdy! We're interested in potentially making use of this in a few of our projects. If you're receptive to PRs, I have a small handful of other sources that we draw from, in addition to NCIt (and let me know if I'm missing anything here).

cthoyt commented 2 years ago

@jsstevenson absolutely, I would love to accept external contributions! I would also like to write a manuscript about the current state of versioning in biomedical database and ontology world, and how bioversions could be useful for the community, so if you're thinking about this stuff too I'd be keen to learn more and see if you'd want to help write that paper

jsstevenson commented 2 years ago

Unfortunately the NCIt FTP archives follow a folder structure that is a little hard to capture in a single f-string -- they place the current year's releases one level up from prior years (which are all housed in subdirectories for each year), eg

2020/20.11e Release/
2020/20.12d Release/
21.08e Release/

I'd definitely be interested in getting in touch -- one of our group's broader projects focuses on knowledgebase integration in the cancer variant interpretation space (https://cancervariants.org/projects/integration/), so we have a vested interest in things like data provenance and reproducibility.

cthoyt commented 2 years ago

I'm going to merge now but if you could send a link to that FTP address I would appreciate it

jsstevenson commented 2 years ago

I'm going to merge now but if you could send a link to that FTP address I would appreciate it

👍

https://evs.nci.nih.gov/ftp1/NCI_Thesaurus/archive/