KoslickiLab / KEGG_sketching_annotation

Scripts to sketch KEGG and explore using FracMinHash as a way to functionally annotate a metagenome
MIT License
2 stars 2 forks source link

Downloading genomes from GenBank using FTP runs into ftblib.error_perm #21

Open mahmudhera opened 2 years ago

mahmudhera commented 2 years ago

Command: get_reference_genomes.py -n 600 -s data -u

Script where error occurs: get_reference_genome.py

Traceback:

File "../../scripts/get_reference_genomes.py", line 242, in main() File "../../scripts/get_reference_genomes.py", line 194, in main helper.go_to_direct() File "../../scripts/get_reference_genomes.py", line 45, in go_to_direct ftp.cwd(ftp.nlst()[0]) File "/home/grads/mbr5797/.conda/envs/KEGG_env/lib/python3.8/ftplib.py", line 621, in cwd return self.voidcmd(cmd) File "/home/grads/mbr5797/.conda/envs/KEGG_env/lib/python3.8/ftplib.py", line 282, in voidcmd return self.voidresp() File "/home/grads/mbr5797/.conda/envs/KEGG_env/lib/python3.8/ftplib.py", line 255, in voidresp resp = self.getresp() File "/home/grads/mbr5797/.conda/envs/KEGG_env/lib/python3.8/ftplib.py", line 250, in getresp raise error_perm(resp) ftplib.error_perm: 550 GCF_000022645.1_ASM2264v1_assembly_report.txt: No such file or directory

As @Omar-HeshamR reported verbally, this error is sporadic, and does not repeat deterministically.

mahmudhera commented 2 years ago

Removing everything and then re-running, we have the following error:

Traceback (most recent call last): File "../../scripts/get_reference_genomes.py", line 242, in main() File "../../scripts/get_reference_genomes.py", line 204, in main helper.download_FNA_file(path, current_directory_name) File "../../scripts/get_reference_genomes.py", line 76, in download_FNA_file ftp.retrbinary(f"RETR {filename}", file.write) File "/home/grads/mbr5797/.conda/envs/KEGG_env/lib/python3.8/ftplib.py", line 441, in retrbinary return self.voidresp() File "/home/grads/mbr5797/.conda/envs/KEGG_env/lib/python3.8/ftplib.py", line 255, in voidresp resp = self.getresp() File "/home/grads/mbr5797/.conda/envs/KEGG_env/lib/python3.8/ftplib.py", line 240, in getresp resp = self.getmultiline() File "/home/grads/mbr5797/.conda/envs/KEGG_env/lib/python3.8/ftplib.py", line 226, in getmultiline line = self.getline() File "/home/grads/mbr5797/.conda/envs/KEGG_env/lib/python3.8/ftplib.py", line 214, in getline raise EOFError EOFError

Looks like it will be hard to track down why this is happening.

mahmudhera commented 2 years ago

Running the same command three more times results in the EOFError when downloading the same genome: Coprobacillus_sp._AF13-4LB.

Perhaps there is pattern after all. Now trying to track down why this happens.

mahmudhera commented 2 years ago

It looks like I was mistaken, the error is not for Coprobacillus_sp._AF13-4LB. Coprobacillus_sp._AF13-4LB is downloaded correctly without any issues. The problem is with the next genome in the list: Streptomyces_sp._SID7817. The genomic.fna file is not downloaded.

@Omar-HeshamR if you are investigating Coprobacillus_sp._AF13-4LB, you may want to skip that for now.

Probably the correct way to go about it is to investigate the error message itself first. The ftplib.py getline() function documents that an EOFError occurs from a closed connection. I don't think there is much to do about closed connection, other than simply skipping this genome.

Omar-HeshamR commented 2 years ago

I tested with Streptomyces_sp._SID7817 and it worked completely fine using the same code, so again I think it is independent from the genome it self, but rather to do with the connection with ftplib, still investigating.

mahmudhera commented 2 years ago

I think you are correct. I am leaving this for tonight and will look into it tomorrow again, but I guess we are too optimistic assuming that the ftp connection will stay alive for 500/1000 genomes. Probably the connection just closes itself after a number of downloads. We could try to reestablish the connection periodically, or use multiple threads for a limited number of genomes. I’m still not sure which would be the best way to go about it. Using multiple threads may also make it faster, but also introduces the added complexity of dividing and coordinating among the threads.

Omar-HeshamR commented 2 years ago

Yes am going to start by first trying an approach of resetting the connection periodically to see if that's the root cause of the problem, then I will look into using multiple threads to make it faster.

mahmudhera commented 2 years ago

I added a quick fix in this script. This is just invoking the same script bunch of times with different random seeds. Every invocation downloads 10 genomes. There are 51 invocations. There should have been 510 genomes. Naturally, some genomes were repeated. In the end, after running this script, we have 472 genomes downloaded without any errors. I think that 10 genomes are small enough that the server did not interrupt the FTP connection.

This is not the cleanest solution, but at least now we have a large number of genomes to start experimenting.

Omar-HeshamR commented 2 years ago

Yes, after running the experiment 100s of times, I think the average crash time is after ~40 genomes, so I agree that 10 should rarely crash. I think your approach is most likely faster, let me know if we are for sure going with that approach, so that I am aware if I should keep testing mine or not.