davised / get_assemblies

Download assembly files from NCBI
Other
11 stars 1 forks source link

mixed genomes #3

Open pavlo888 opened 2 years ago

pavlo888 commented 2 years ago

Hi @davised

So I tried downloading some genomes classified as "Agrobacterium rhizogenes". However, I see that other genomes are also downloaded, including those of Bacillus, Enterococcus, Leptospira, Salmonella, Staphylococcus.

I mean it would be easy to depurate these genomes from the collection. I assume this error originates in the database, right? Or is it because of the get_assemblies package?

I ran the following command: cat metadata.tab | get_assemblies assembly_ids - --function genomes -o fna

It seems that all the 96 genomes were downloaded however I got the following error: `Traceback (most recent call last): File "/Users/pablo/opt/anaconda3/envs/get-assemblies2/lib/python3.10/urllib/request.py", line 2449, in retrfile self.ftp.cwd(file) File "/Users/pablo/opt/anaconda3/envs/get-assemblies2/lib/python3.10/ftplib.py", line 625, in cwd return self.voidcmd(cmd) File "/Users/pablo/opt/anaconda3/envs/get-assemblies2/lib/python3.10/ftplib.py", line 286, in voidcmd return self.voidresp() File "/Users/pablo/opt/anaconda3/envs/get-assemblies2/lib/python3.10/ftplib.py", line 259, in voidresp resp = self.getresp() File "/Users/pablo/opt/anaconda3/envs/get-assemblies2/lib/python3.10/ftplib.py", line 254, in getresp raise error_perm(resp) ftplib.error_perm: 550 GCA_001367915.1_10493_1: No such file or directory

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/Users/pablo/opt/anaconda3/envs/get-assemblies2/lib/python3.10/urllib/request.py", line 1572, in ftp_open fp, retrlen = fw.retrfile(file, type) File "/Users/pablo/opt/anaconda3/envs/get-assemblies2/lib/python3.10/urllib/request.py", line 2451, in retrfile raise URLError('ftp error: %r' % reason) from reason urllib.error.URLError: <urlopen error ftp error: error_perm('550 GCA_001367915.1_10493_1: No such file or directory')>

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/Users/pablo/opt/anaconda3/envs/get-assemblies2/lib/python3.10/site-packages/get_assemblies/main.py", line 121, in dl_gzip copy_url(pbar, task_id, uri, filename) File "/Users/pablo/opt/anaconda3/envs/get-assemblies2/lib/python3.10/site-packages/get_assemblies/main.py", line 142, in copy_url with urlopen(uri) as response: File "/Users/pablo/opt/anaconda3/envs/get-assemblies2/lib/python3.10/urllib/request.py", line 216, in urlopen return opener.open(url, data, timeout) File "/Users/pablo/opt/anaconda3/envs/get-assemblies2/lib/python3.10/urllib/request.py", line 519, in open response = self._open(req, data) File "/Users/pablo/opt/anaconda3/envs/get-assemblies2/lib/python3.10/urllib/request.py", line 536, in _open result = self._call_chain(self.handle_open, protocol, protocol + File "/Users/pablo/opt/anaconda3/envs/get-assemblies2/lib/python3.10/urllib/request.py", line 496, in _call_chain result = func(*args) File "/Users/pablo/opt/anaconda3/envs/get-assemblies2/lib/python3.10/urllib/request.py", line 1583, in ftp_open raise exc.with_traceback(sys.exc_info()[2]) File "/Users/pablo/opt/anaconda3/envs/get-assemblies2/lib/python3.10/urllib/request.py", line 1572, in ftp_open fp, retrlen = fw.retrfile(file, type) File "/Users/pablo/opt/anaconda3/envs/get-assemblies2/lib/python3.10/urllib/request.py", line 2451, in retrfile raise URLError('ftp error: %r' % reason) from reason urllib.error.URLError: <urlopen error ftp error: URLError("ftp error: error_perm('550 GCA_001367915.1_10493_1: No such file or directory')")>

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/Users/pablo/opt/anaconda3/envs/get-assemblies2/lib/python3.10/urllib/request.py", line 1565, in ftp_open fw = self.connect_ftp(user, passwd, host, port, dirs, req.timeout) File "/Users/pablo/opt/anaconda3/envs/get-assemblies2/lib/python3.10/urllib/request.py", line 1586, in connect_ftp return ftpwrapper(user, passwd, host, port, dirs, timeout, File "/Users/pablo/opt/anaconda3/envs/get-assemblies2/lib/python3.10/urllib/request.py", line 2407, in init self.init() File "/Users/pablo/opt/anaconda3/envs/get-assemblies2/lib/python3.10/urllib/request.py", line 2419, in init self.ftp.cwd(_target) File "/Users/pablo/opt/anaconda3/envs/get-assemblies2/lib/python3.10/ftplib.py", line 625, in cwd return self.voidcmd(cmd) File "/Users/pablo/opt/anaconda3/envs/get-assemblies2/lib/python3.10/ftplib.py", line 286, in voidcmd return self.voidresp() File "/Users/pablo/opt/anaconda3/envs/get-assemblies2/lib/python3.10/ftplib.py", line 259, in voidresp resp = self.getresp() File "/Users/pablo/opt/anaconda3/envs/get-assemblies2/lib/python3.10/ftplib.py", line 254, in getresp raise error_perm(resp) ftplib.error_perm: 550 genomes/all/GCA/001/367/915/GCF_001367915.1_10493_1_6: No such file or directory

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/Users/pablo/opt/anaconda3/envs/get-assemblies2/bin/get_assemblies", line 8, in sys.exit(main()) File "/Users/pablo/opt/anaconda3/envs/get-assemblies2/lib/python3.10/site-packages/get_assemblies/main.py", line 1101, in main download_genomes(args.o, dl_mapping, args.threads) File "/Users/pablo/opt/anaconda3/envs/get-assemblies2/lib/python3.10/site-packages/get_assemblies/main.py", line 1058, in download_genomes output = future.result() File "/Users/pablo/opt/anaconda3/envs/get-assemblies2/lib/python3.10/concurrent/futures/_base.py", line 439, in result return self.get_result() File "/Users/pablo/opt/anaconda3/envs/get-assemblies2/lib/python3.10/concurrent/futures/_base.py", line 391, in get_result raise self._exception File "/Users/pablo/opt/anaconda3/envs/get-assemblies2/lib/python3.10/concurrent/futures/thread.py", line 58, in run result = self.fn(*self.args, *self.kwargs) File "/Users/pablo/opt/anaconda3/envs/get-assemblies2/lib/python3.10/site-packages/get_assemblies/main.py", line 126, in dl_gzip copy_url(uri, task_id, uri, filename) File "/Users/pablo/opt/anaconda3/envs/get-assemblies2/lib/python3.10/site-packages/get_assemblies/main.py", line 142, in copy_url with urlopen(uri) as response: File "/Users/pablo/opt/anaconda3/envs/get-assemblies2/lib/python3.10/urllib/request.py", line 216, in urlopen return opener.open(url, data, timeout) File "/Users/pablo/opt/anaconda3/envs/get-assemblies2/lib/python3.10/urllib/request.py", line 519, in open response = self._open(req, data) File "/Users/pablo/opt/anaconda3/envs/get-assemblies2/lib/python3.10/urllib/request.py", line 536, in _open result = self._call_chain(self.handle_open, protocol, protocol + File "/Users/pablo/opt/anaconda3/envs/get-assemblies2/lib/python3.10/urllib/request.py", line 496, in _call_chain result = func(args) File "/Users/pablo/opt/anaconda3/envs/get-assemblies2/lib/python3.10/urllib/request.py", line 1583, in ftp_open raise exc.with_traceback(sys.exc_info()[2]) File "/Users/pablo/opt/anaconda3/envs/get-assemblies2/lib/python3.10/urllib/request.py", line 1565, in ftp_open fw = self.connect_ftp(user, passwd, host, port, dirs, req.timeout) File "/Users/pablo/opt/anaconda3/envs/get-assemblies2/lib/python3.10/urllib/request.py", line 1586, in connect_ftp return ftpwrapper(user, passwd, host, port, dirs, timeout, File "/Users/pablo/opt/anaconda3/envs/get-assemblies2/lib/python3.10/urllib/request.py", line 2407, in init self.init() File "/Users/pablo/opt/anaconda3/envs/get-assemblies2/lib/python3.10/urllib/request.py", line 2419, in init self.ftp.cwd(_target) File "/Users/pablo/opt/anaconda3/envs/get-assemblies2/lib/python3.10/ftplib.py", line 625, in cwd return self.voidcmd(cmd) File "/Users/pablo/opt/anaconda3/envs/get-assemblies2/lib/python3.10/ftplib.py", line 286, in voidcmd return self.voidresp() File "/Users/pablo/opt/anaconda3/envs/get-assemblies2/lib/python3.10/ftplib.py", line 259, in voidresp resp = self.getresp() File "/Users/pablo/opt/anaconda3/envs/get-assemblies2/lib/python3.10/ftplib.py", line 254, in getresp raise error_perm(resp) urllib.error.URLError: <urlopen error ftp error: error_perm('550 genomes/all/GCA/001/367/915/GCF_001367915.1_10493_1_6: No such file or directory')> `

Is the error showing at the end important? I am attaching the log file as well.

Cheers, Pablo get_assemblies.log

davised commented 2 years ago

Hi Pablo,

You'll need to get the assembly ids you want to download in a file, e.g.

Check out the metadata.tab file that is created after running this command. Generally you will want to select a subset from your search. One way to do this is to select the lines that include the genomes of interest, and then saving the assembly accessions to a file. You can either delete the lines that you don't want, or use grep to pull out the lines that you want to keep. Then you can use cut -f 14 > accs.txt to get the assembly accesions in a file.

$ cat accs.txt | get_assemblies assembly_ids - --function genomes -o fna

Putting the entire metadata file in as input is not currently supported.

davised commented 2 years ago

And carefully check the logs - K599 has a note in refseq that excludes it from the database. You'll need to use --force to force the program to download K599 for example.

By default, if the seq is excluded from refseq then I don't download it.

pavlo888 commented 2 years ago

I have this time made a list of only the accession numbers (see attached file) but I still get some kind of error

This is the code I ran cat metadata_acc.tsv | get_assemblies assembly_ids - --function genomes -o fna

This is the error message Traceback (most recent call last): File "/Users/pablo/opt/anaconda3/envs/get-assemblies2/bin/get_assemblies", line 8, in <module> sys.exit(main()) File "/Users/pablo/opt/anaconda3/envs/get-assemblies2/lib/python3.10/site-packages/get_assemblies/__main__.py", line 1101, in main download_genomes(args.o, dl_mapping, args.threads) File "/Users/pablo/opt/anaconda3/envs/get-assemblies2/lib/python3.10/site-packages/get_assemblies/__main__.py", line 1058, in download_genomes output = future.result() File "/Users/pablo/opt/anaconda3/envs/get-assemblies2/lib/python3.10/concurrent/futures/_base.py", line 439, in result return self.__get_result() File "/Users/pablo/opt/anaconda3/envs/get-assemblies2/lib/python3.10/concurrent/futures/_base.py", line 391, in __get_result raise self._exception File "/Users/pablo/opt/anaconda3/envs/get-assemblies2/lib/python3.10/concurrent/futures/thread.py", line 58, in run result = self.fn(*self.args, **self.kwargs) File "/Users/pablo/opt/anaconda3/envs/get-assemblies2/lib/python3.10/site-packages/get_assemblies/__main__.py", line 130, in dl_gzip shutil.copyfileobj(gzfh, outfh, 65536) File "/Users/pablo/opt/anaconda3/envs/get-assemblies2/lib/python3.10/shutil.py", line 195, in copyfileobj buf = fsrc_read(length) File "/Users/pablo/opt/anaconda3/envs/get-assemblies2/lib/python3.10/gzip.py", line 301, in read return self._buffer.read(size) File "/Users/pablo/opt/anaconda3/envs/get-assemblies2/lib/python3.10/_compression.py", line 68, in readinto data = self.read(len(byte_view)) File "/Users/pablo/opt/anaconda3/envs/get-assemblies2/lib/python3.10/gzip.py", line 496, in read uncompress = self._decompressor.decompress(buf, size) zlib.error: Error -3 while decompressing data: invalid literal/length code

metadata_acc2.txt

davised commented 2 years ago

This is happening because there is a "#" sign in the assemblyname field

GCF_001367915.1 -> 10493_1#6

I'll get this fixed shortly.

davised commented 2 years ago

Ok please upgrade your program:

python3 -m pip install -U get-assemblies and let me know if your problem is resolved.

davised commented 2 years ago

Also, to be clear, the idea would be you filter the genomes you don't want from the metadata.tab file, then you extract the accession ids, then you send those to get_assemblies to only download the ones you need.

That way, you don't have to spend bandwidth/filespace to get genomes you know you don't want.