BgeeDB / bgee_pipeline

Source code of the Bgee pipeline used to build the Bgee database
https://www.bgee.org/
Creative Commons Zero v1.0 Universal
11 stars 4 forks source link

Automatic check of links in download files #1

Open fbastian opened 6 years ago

fbastian commented 6 years ago

Implement an automatic verification of all links provided in the download files (we had problems of outdated URLs, or of missing files, that we only discovered after the files were released).

fbastian commented 6 years ago

Also, I see that we use SRA IDs to link to GEO in download files, but this doesn't work, see e.g. link http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=ERX012344 in ftp://ftp.bgee.org/bgee_v14_0/download/processed_expr_values/rna_seq/Mus_musculus/Mus_musculus_RNA-Seq_experiments_libraries.tar.gz, which should actually be https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE30617 from what I understand.

Do we have the information necessary during download file generation to fix this problem @smoretti?

smoretti commented 6 years ago

We use SRX, ERX and DRX identifiers to ease download with the sra_toolkit. So the direct link for those identifiers should be https://www.ncbi.nlm.nih.gov/sra/?term=ERX012344

smoretti commented 6 years ago

Column with GEO link should be removed to make the file more simple.

Only SRA link (after URL correction) should remain.