Open callahantiff opened 3 years ago
check out the bioversions project, I'm working on similar stuff for solving this problem... unfortunately the state of versioned biomedical data is just as lacking as most other things 🤡
@cthoyt - brilliant, yes! Will definitely work on this for upcoming releases. Thanks for pointing this out!
@callahantiff please let me know if there are any resources you're using that aren't supported by bioversions already and I will add them. The syntax to get the current version for one is:
import bioversions
version_string = bioversions.get_version('resource name')
TASK
Currently, the build downloads are via the
builds/data_to_download.txt
, which is a list of URLs. While this will work for 90% of the existing data used, there are a few data provides that include explicit versions in the URLs. As of now, this means that unless we update this text file we will not be guaranteed to get the most current data. Additionally, some of the downloads rely on running a query against a data provider's API. This should always result in the most up-to-date data, but we should verify this also.The following resources include explicit versions in the URLs and will need updates to resolve the aforementioned problem:
ftp://ftp.ensembl.org/pub/release-102/gtf/homo_sapiens/Homo_sapiens.GRCh38.102.gtf.gz
➞ Ensemblftp://ftp.ensembl.org/pub/release-102/tsv/homo_sapiens/Homo_sapiens.GRCh38.102.uniprot.tsv.gz
➞ Ensemblftp://ftp.ensembl.org/pub/release-102/tsv/homo_sapiens/Homo_sapiens.GRCh38.102.entrez.tsv.gz
➞ Ensemblftp://nlmpubs.nlm.nih.gov/online/mesh/rdf/2021/mesh2021.nt
➞ MeSHGTEx_Analysis_2017-06-05_v8_RNASeQCv1.1.9_gene_median_tpm.gct
➞ GTeX9606.protein.links.v11.0.txt.gz
➞ STRINGThe following resources are generated from querying an API:
TODO