Handle Versions - Githubissues

cbizon commented 4 years ago

NodeNormalization depends on getting data from somewhere. That data currently comes from Babel. Wherever it comes from, it should be versioned and those versions should be exposed. In some cases, those versions will themselves depend on Biolink versions. It's not 100% clear to me how to manage that chain of versions.

gaurav commented 2 years ago

For Babel in general, we currently have four levels of possible versioning going on:

Now and in the future, I plan to tag the Babel code used to generate a particular Babel release -- for instance, this was the code used to generate the 2022sep6 release: https://github.com/TranslatorSRI/Babel/releases/tag/2022sep6
- Some input data files are included in the repository and so are marked with this tag, e.g. https://github.com/TranslatorSRI/Babel/tree/2022sep6/input_data
For many input files, the Babel source code explicitly downloads a particular version of an input -- for example, pantherpathways.py for 2022sep6 downloaded its source data from http://data.pantherdb.org/ftp/pathway/current_release/SequenceAssociationPathway3.6.6.txt. Even better, this file will be removed from the current directory when it is superceded, so that step in Babel will fail until someone updates the version number to the current one.
For most input files, we always download the latest version of a file from a fixed place (e.g. ncbitaxon.py always downloads its source data from http://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz). Since we wipe out babel_downloads every time we start a new Babel run, we should always get the latest version of this file; however, we don't currently have any way of recording which version of this file we have downloaded, although we know it is was sometime around the Babel release date.
- I would argue that being able to say "we downloaded NCBI Taxonomy around Sep 6, 2022" is probably good enough for our needs.
- If not, we could log information about the source files in the Babel output -- perhaps in the babel_outputs/reports/sources/[datahandler] directory. This could be the URL retrieved, the last-modified date on the downloaded file(s), or the version number read from a README that was present alongside or within the input file. One neat thing about doing that is that we could publish an overall Sources summary report with each Babel release sorted by last-modified date, which would be a way to catch sources that had not been uploaded in a while or soliciting feedback on which sources are known to have problems.
- We could make this information public by keeping a copy of the Sources summary report in the Github repository and updating it after every release. That way we can also look at historical copies of this document as tagged with a release.
Finally, we incorporate some data from online databases like Ubergraph that are updated on their own schedule.
- As above, we could log sources information from these databases as well -- either by determining an overall version number for that resource, or getting the last-modified data for a particular subresource that we use (such as a particular ontology within Ubergraph)

If the above is good enough, I think we can return the Babel version number (e.g. 2022sep6) with every NodeNorm request so users know where their data came from.

If, however, we want to maintain provenance at the clique level, this will require changing how we generate the glom files so that provenance can be tracked and included. However, the proposal above re: recording source information for each datahandler would still be useful in (1) giving us an overall picture of what's included in a particular Babel release, and (2) producing the source versioning information we would need to actually generate that provenance.

cbizon commented 2 years ago

I don't think that we need per-clique provenance.

I'm not sure that just the Babel version number is enough though, because the same version of the code could be used multiple times and pull different data sets as you note.

So I think that there's a version (could be a date, could just be a number) of the overall collection. The babel version is associated with that overall version, along with the versions of all the inputs. I could imagine wanting versions of the individual compendia files themselves sort of how individual chromosome assemblies have a version and then the collection also has a version so that you can know that the next version of the collection is the same as the past one, but with a new compendium.

TranslatorSRI / NodeNormalization

Handle Versions #15