bigbio / py-pgatk

Python tools for proteogenomics analysis toolkit
Apache License 2.0
10 stars 11 forks source link

Downloading all the cBioPortal experiments failing #20

Closed ypriverol closed 3 years ago

ypriverol commented 5 years ago

We need to review because when all the studies from cBioPortal have downloaded the pipelines fail.

ypriverol commented 5 years ago

@husensofteng I have reviewed the code of cBioPortal and do some minor changes including multithreading. Now we can download multiple databases at the same time. Please have a look and close this issue if everything works in your side.

I notice multiple studies are now available. In total, I downloaded 252 studies on my side.

husensofteng commented 5 years ago

it does not work here! I still only get a few studies downloads and then it stalls.

ypriverol commented 5 years ago

@it takes a while, some of the experiments are big and the download is not fast. For me, it downloads the first 100 in 2:30 h. Can you enable --multithreading like:

cbioportal-downloader --output_directory /your_folder/ -d all --multithreading

husensofteng commented 5 years ago

when --multithreading enabled it did download more studies (56 in total) but then it got stuck at the following line:

ERROR:root:Error downloading -- Incorrect URL or file not found: http://download.cbioportal.org/sarc_tcga_pub.tar.gz on trial no: 0

but when I take the url above to a web browser the file gets downloaded fine!

after restarting the script, it does download the study above but then gets stuck on some other studies, e.g.:

ERROR:root:Error downloading -- Incorrect URL or file not found: http://download.cbioportal.org/esca_tcga_pan_can_atlas_2018.tar.gz on trial no: 0 ERROR:root:Error code: <urlopen error retrieval incomplete: got only 19411096 out of 59095674 bytes> ERROR:root:Error downloading -- Incorrect URL or file not found: http://download.cbioportal.org/gbm_tcga.tar.gz on trial no: 0 ERROR:root:Error code: <urlopen error retrieval incomplete: got only 22929735 out of 189434790 bytes> ERROR:root:Error downloading -- Incorrect URL or file not found: http://download.cbioportal.org/gbm_tcga_pub2013.tar.gz on trial no: 0 ERROR:root:Error code: <urlopen error retrieval incomplete: got only 12316156 out of 37068654 bytes>

I will try to run it with other internet connections, to make sure it is not the university firewall that somehow makes it troublesome.

ypriverol commented 5 years ago

I put the retries becuase of that. I think their server is really bad. We can do for each file 3 retries.

husensofteng commented 5 years ago

consider to download the studies from cBioportal's datahub on github

husensofteng commented 3 years ago

this issue has been fixed by using git-lfs