StuntsPT / NCBI_Mass_Downloader

A program to download large amounts of sequences from NCBI databases.
GNU General Public License v3.0
19 stars 7 forks source link

Control point to resume the download manually #27

Open Phismil opened 1 year ago

Phismil commented 1 year ago

Dear Developer, Thank you for maintaining and updating the repository. I am trying to download all COI sequences from NCBI, which is around four million entries. The download consistently failed after approximately 600 000 sequences, and adding my API key and changing the source code to stay asleep for longer than 8s did not help. Is there any trick that might solve the issue or resume the download later on, exactly from the last entry in the interrupted fasta file (e.g., similar to a control point in the web history)? Thank you

StuntsPT commented 1 year ago

Dear @Phismil, First of all, thank you for reaching out. If you run the same command again, the download should resume from where it left off. It might take a while to restart, as the program will download the accession number list, and compare it with those already in the FASTA file, but it shoudl resume the download. Also, I am curious, how exactly is the program failing? Does the program crash? Is there any error message? Or does it just freeze and stops downloading data? Thank you.

Francisco

Phismil commented 1 year ago

Dear Francisco, Thank you for your response. I replicated the error. Below is the error I receive, which usually happens after downloading ~500-600K records. It might be directly linked to our local server/proxy setting. I will try it on an AWS or GC engine and update you.

Downloading records 692401 to 692600 of 4122811
Downloading record 692601 to 692800 of 4122811
NCBI is not retuning sequence data or we're getting a Timeout. Trying the same chunk again in 8''.
NCBI is not retuning sequence data or we're getting a Timeout. Trying the same chunk again in 8''.
NCBI is not retuning sequence data or we're getting a Timeout. Trying the same chunk again in 8''.
NCBI is not retuning sequence data or we're getting a Timeout. Trying the same chunk again in 8''.

Cheers

StuntsPT commented 1 year ago

And then it just stops there? Or does it actually crash? Can you please also post the exact command you are using, program version, and the last few sequence names (just the > lines) from the resulting FASTA file?

What happens if you just Ctrl+C the program, and run the same exact command again?

Best, Francisco

Phismil commented 1 year ago

Dear Francisco I apologize for the delay; I wanted to spin a new computing engine before updating you. The pipeline has downloaded all COI sequences (~4 million) in three to four attempts in both Amazon AWS and Google Cloud Engin when there is no proxy setting. In the university's local server, with typical proxy settings, occasionally, the pipeline needs > 10 attempts to download all sequences. I checked the tail of generated .fasta files, and there was nothing unusual such as an error or a warning from the NCBI server. It was just a normal ending, and when I restarted the pipeline to resume the download of missing records, the new records were appended to the generated .fasta file. Thank you for your time, and please let me know if more information is needed.

StuntsPT commented 1 year ago

Dear @Phismil,

Thank you for the follow up. I'm happy to read you managed to get your sequences. You are not the first person having issues when behind a proxy server, but I wouldn't even know where to start debugging. More likely than not, it is an issue between NCBI and the proxy server. The issue is that the program is neither getting an error response, nor is the requests library issuing a timeout (which means it somehow still thinks it's receiving data). I will leave the issue open for now, and try to reach the bottom of it during summertime. I may then request your help again in running the program and reporting on whether or not it worked. =-) Best,

Francisco

Phismil commented 1 year ago

I will do this with pleasure. Yes, the problem is exactly what you mentioned, and it consistently happens after downloading 500K to 600K sequences. Cheers