CI/CD Pipeline: Ensuring Builds Use Most Current Data

callahantiff commented 3 years ago

TASK

Currently, the build downloads are via the builds/data_to_download.txt, which is a list of URLs. While this will work for 90% of the existing data used, there are a few data provides that include explicit versions in the URLs. As of now, this means that unless we update this text file we will not be guaranteed to get the most current data. Additionally, some of the downloads rely on running a query against a data provider's API. This should always result in the most up-to-date data, but we should verify this also.

The following resources include explicit versions in the URLs and will need updates to resolve the aforementioned problem:

ftp://ftp.ensembl.org/pub/release-102/gtf/homo_sapiens/Homo_sapiens.GRCh38.102.gtf.gz ➞ Ensembl
ftp://ftp.ensembl.org/pub/release-102/tsv/homo_sapiens/Homo_sapiens.GRCh38.102.uniprot.tsv.gz ➞ Ensembl
ftp://ftp.ensembl.org/pub/release-102/tsv/homo_sapiens/Homo_sapiens.GRCh38.102.entrez.tsv.gz ➞ Ensembl
ftp://nlmpubs.nlm.nih.gov/online/mesh/rdf/2021/mesh2021.nt ➞ MeSH
GTEx_Analysis_2017-06-05_v8_RNASeQCv1.1.9_gene_median_tpm.gct ➞ GTeX
9606.protein.links.v11.0.txt.gz ➞ STRING

The following resources are generated from querying an API:

UniProt Identifier Mapping ➞ API Query
Human Protein Atlas ➞ API Query
Human Proteins from PRO Consortium ➞ API Query
UniProt Cofactor-Catalyst Data ➞ API Query

TODO

[ ] Modify the download code for explicitly versioned URLs to ensure that we are always getting the most updated data
[ ] Verify that resources downloaded via API queries will also return the most updated results

cthoyt commented 3 years ago

check out the bioversions project, I'm working on similar stuff for solving this problem... unfortunately the state of versioned biomedical data is just as lacking as most other things 🤡

callahantiff commented 3 years ago

@cthoyt - brilliant, yes! Will definitely work on this for upcoming releases. Thanks for pointing this out!

cthoyt commented 3 years ago

@callahantiff please let me know if there are any resources you're using that aren't supported by bioversions already and I will add them. The syntax to get the current version for one is:

import bioversions
version_string = bioversions.get_version('resource name')

callahantiff / PheKnowLator

CI/CD Pipeline: Ensuring Builds Use Most Current Data #90

TASK

TODO