MicroB3-IS / osd-analysis

Repository for all Ocean Sampling Day related source code with information on how-to acquire OSD data
Apache License 2.0
13 stars 7 forks source link

Bulk download of EBI-metagenomics processed files #15

Open rec3141 opened 9 years ago

rec3141 commented 9 years ago

I couldn't find a way to batch download the processed files so I made my own excel spreadsheet to generate the proper URLs and wget commands. If someone has a better way (i.e. an EBI API) please share it. Otherwise I hope you might find this useful. Change the yellow cell to the files you would like to download and it will automatically populate the blue cells with the URLs and wget calls.

on github: https://github.com/rec3141/OSD/blob/master/osd-ebi-samples.xlsx

cheers, Eric Collins

mscheremetjew commented 9 years ago

Dear Eric, Unfortunately there is no function in place yet (like an API), which supports bulk download of project result file. We know that this would be a very sensible feature for our users and it is on our long list of 2Dos.

Yes, in the meantime you would have to script something together on your own using tools like wget, curl etc.

e.g.

curl -o OSD1_2014-06-21_0m_NPL022_reads.fasta "https://www.ebi.ac.uk/metagenomics/projects/ERP009703/samples/ERS667668/runs/ERR771106/results/sequences/versions/2.0/export?contentType=text&exportValue=processedReads"

Thanks for sharing this Excel spreadsheet with the community.

Best, Maxim Senior Software Developer - EMBL-EBI

rec3141 commented 9 years ago

Thanks Maxim. Is there a way to request the larger files in compressed format?

mscheremetjew commented 9 years ago

It depends on how urgent do you need them? We could either upload the compressed files on our EBI FTP server (quick thing) or you would have to wait another 2 to 3 weeks until we have put them chunked and compressed on our website. Let me know what you prefer? Best, Maxim

mscheremetjew commented 9 years ago

Hi Eric, Quick update on that. The InterProScan result files are now available as compressed files. If they are bigger then 2Gigabytes then we do chunk them before compression. But I think for OSD that is never the case. The URLs for the InterProScan result files changed. To request the number of chunks you call: https://www.ebi.ac.uk/metagenomics/projects/ERP009703/samples/ERS667660/runs/ERR770988/results/versions/2.0/function/InterProScan/chunks

Then you have to iterate over the number of chunks: https://www.ebi.ac.uk/metagenomics/projects/ERP009703/samples/ERS667660/runs/ERR770988/results/versions/2.0/function/InterProScan/chunks/{1...n}

We hope to get the FASTA formatted files chunked and compressed as well in the near future.

Best, Maxim

mscheremetjew commented 8 years ago

All larger result files are now available as compressed files (gzip): https://www.ebi.ac.uk/metagenomics/projects/ERP009703

noaagoodwink commented 8 years ago

Hi Maxim,

Are there new urls? Do I need to change the attached script?

thank you, -Kelly

Kelly D. Goodwin, Ph.D. National Oceanic and Atmospheric Administration AOML & SWFSC

8901 La Jolla Shores Drive La Jolla, CA 92037 858 546 7142 FAX: 858 546-7003 http://www.aoml.noaa.gov/ocd/people/goodwin/

On Mon, Oct 26, 2015 at 7:04 AM, Maxim notifications@github.com wrote:

All larger result files are now available as compressed files (gzip): https://www.ebi.ac.uk/metagenomics/projects/ERP009703

— Reply to this email directly or view it on GitHub https://github.com/MicroB3-IS/osd-analysis/issues/15#issuecomment-151143531 .

mscheremetjew commented 8 years ago

Hi Kelly,

Which script do you refer to? The one from Eric? Just looked into Eric's Excel sheet. The URLs for the sequence section changed since summer.

As the OSD result files are relatively small, we kept them unchunked. The URLs need changing to: https://www.ebi.ac.uk/metagenomics/projects/ERP009703/samples/ERS667478/runs/ERR771028/results/versions/2.0/sequences/ProcessedReads/chunks/1

The template URL is: https://www.ebi.ac.uk/metagenomics/projects/{project_id}/samples/{sample_id}/runs/{run_id}/results/versions/{version_number}/{domain}/{result_file_type}/chunks/1

Here is a list of supported domains and result file types so far: Values for the different domains are:

Domain Result file type
sequences ProcessedReads
ReadsWithPredictedCDS
ReadsWithMatches
ReadsWithoutMatches
PredictedCDS
PredictedORFWithoutAnnotation
PredicatedCDSWithoutAnnotation
--------------- -----------------
function InterProScan
mscheremetjew commented 8 years ago

I have quickly put a Python script together to support project bulk download for individual unchunked result files types. I believe all of the OSD result files are unchunked. Documentation, including a link to the script could be find here: https://github.com/ProteinsWebTeam/ebi-metagenomics/wiki/Downloading-results-programmatically

The script won't work for the taxonomy section. That part needs to be integrated. Of course the script needs further improvement.

Any feedback will be appreciated. I am happy to answer more questions if needed. Best, Maxim

noaagoodwink commented 8 years ago

thank you Maxim. could you please supple us with the input file (the mapping file) to ensure that the script runs without error?

thank you, -kelly

Kelly D. Goodwin, Ph.D. National Oceanic and Atmospheric Administration AOML & SWFSC

8901 La Jolla Shores Drive La Jolla, CA 92037 858 546 7142 FAX: 858 546-7003 http://www.aoml.noaa.gov/ocd/people/goodwin/

On Fri, Oct 30, 2015 at 3:30 AM, Maxim notifications@github.com wrote:

I have quickly put a Python script together to support project bulk download for individual unchunked result files types. I believe all of the OSD result files are unchunked. Documentation, including a link to the script could be find here:

https://github.com/ProteinsWebTeam/ebi-metagenomics/wiki/Downloading-results-programmatically

The script won't work for the taxonomy section. That part needs to be integrated. Of course the script needs generally improved as well.

Any feedback will be appreciated. I am happy to answer more questions we needed. Best, Maxim

— Reply to this email directly or view it on GitHub https://github.com/MicroB3-IS/osd-analysis/issues/15#issuecomment-152487313 .

mscheremetjew commented 8 years ago

Sure. Please download the input file here: https://github.com/mscheremetjew/osd-analysis/blob/3a1c21c920b5522d108cc6c9c7f0445eff81b464/input_osd_bulk_download.tsv

For downloading all the InterProScan result files into 1 directory your Python call need to look something like this:

python mgportal_bulk_download.py -i input_osd_bulk_download.tsv -o ~/blah -v 2.0 -t ProcessedReads
mscheremetjew commented 8 years ago

Documentation has been updated and the script improved: https://github.com/ProteinsWebTeam/ebi-metagenomics/wiki/Downloading-results-programmatically

Best, Maxim