Open pavlo888 opened 2 years ago
Hi Pablo,
You'll need to get the assembly ids you want to download in a file, e.g.
Check out the metadata.tab file that is created after running this command. Generally you will want to select a subset from your search. One way to do this is to select the lines that include the genomes of interest, and then saving the assembly accessions to a file. You can either delete the lines that you don't want, or use grep to pull out the lines that you want to keep. Then you can use cut -f 14 > accs.txt to get the assembly accesions in a file.
$ cat accs.txt | get_assemblies assembly_ids - --function genomes -o fna
Putting the entire metadata file in as input is not currently supported.
And carefully check the logs - K599 has a note in refseq that excludes it from the database. You'll need to use --force to force the program to download K599 for example.
By default, if the seq is excluded from refseq then I don't download it.
I have this time made a list of only the accession numbers (see attached file) but I still get some kind of error
This is the code I ran
cat metadata_acc.tsv | get_assemblies assembly_ids - --function genomes -o fna
This is the error message
Traceback (most recent call last): File "/Users/pablo/opt/anaconda3/envs/get-assemblies2/bin/get_assemblies", line 8, in <module> sys.exit(main()) File "/Users/pablo/opt/anaconda3/envs/get-assemblies2/lib/python3.10/site-packages/get_assemblies/__main__.py", line 1101, in main download_genomes(args.o, dl_mapping, args.threads) File "/Users/pablo/opt/anaconda3/envs/get-assemblies2/lib/python3.10/site-packages/get_assemblies/__main__.py", line 1058, in download_genomes output = future.result() File "/Users/pablo/opt/anaconda3/envs/get-assemblies2/lib/python3.10/concurrent/futures/_base.py", line 439, in result return self.__get_result() File "/Users/pablo/opt/anaconda3/envs/get-assemblies2/lib/python3.10/concurrent/futures/_base.py", line 391, in __get_result raise self._exception File "/Users/pablo/opt/anaconda3/envs/get-assemblies2/lib/python3.10/concurrent/futures/thread.py", line 58, in run result = self.fn(*self.args, **self.kwargs) File "/Users/pablo/opt/anaconda3/envs/get-assemblies2/lib/python3.10/site-packages/get_assemblies/__main__.py", line 130, in dl_gzip shutil.copyfileobj(gzfh, outfh, 65536) File "/Users/pablo/opt/anaconda3/envs/get-assemblies2/lib/python3.10/shutil.py", line 195, in copyfileobj buf = fsrc_read(length) File "/Users/pablo/opt/anaconda3/envs/get-assemblies2/lib/python3.10/gzip.py", line 301, in read return self._buffer.read(size) File "/Users/pablo/opt/anaconda3/envs/get-assemblies2/lib/python3.10/_compression.py", line 68, in readinto data = self.read(len(byte_view)) File "/Users/pablo/opt/anaconda3/envs/get-assemblies2/lib/python3.10/gzip.py", line 496, in read uncompress = self._decompressor.decompress(buf, size) zlib.error: Error -3 while decompressing data: invalid literal/length code
This is happening because there is a "#" sign in the assemblyname field
GCF_001367915.1 -> 10493_1#6
I'll get this fixed shortly.
Ok please upgrade your program:
python3 -m pip install -U get-assemblies
and let me know if your problem is resolved.
Also, to be clear, the idea would be you filter the genomes you don't want from the metadata.tab file, then you extract the accession ids, then you send those to get_assemblies to only download the ones you need.
That way, you don't have to spend bandwidth/filespace to get genomes you know you don't want.
Hi @davised
So I tried downloading some genomes classified as "Agrobacterium rhizogenes". However, I see that other genomes are also downloaded, including those of Bacillus, Enterococcus, Leptospira, Salmonella, Staphylococcus.
I mean it would be easy to depurate these genomes from the collection. I assume this error originates in the database, right? Or is it because of the get_assemblies package?
I ran the following command:
cat metadata.tab | get_assemblies assembly_ids - --function genomes -o fna
It seems that all the 96 genomes were downloaded however I got the following error: `Traceback (most recent call last): File "/Users/pablo/opt/anaconda3/envs/get-assemblies2/lib/python3.10/urllib/request.py", line 2449, in retrfile self.ftp.cwd(file) File "/Users/pablo/opt/anaconda3/envs/get-assemblies2/lib/python3.10/ftplib.py", line 625, in cwd return self.voidcmd(cmd) File "/Users/pablo/opt/anaconda3/envs/get-assemblies2/lib/python3.10/ftplib.py", line 286, in voidcmd return self.voidresp() File "/Users/pablo/opt/anaconda3/envs/get-assemblies2/lib/python3.10/ftplib.py", line 259, in voidresp resp = self.getresp() File "/Users/pablo/opt/anaconda3/envs/get-assemblies2/lib/python3.10/ftplib.py", line 254, in getresp raise error_perm(resp) ftplib.error_perm: 550 GCA_001367915.1_10493_1: No such file or directory
The above exception was the direct cause of the following exception:
Traceback (most recent call last): File "/Users/pablo/opt/anaconda3/envs/get-assemblies2/lib/python3.10/urllib/request.py", line 1572, in ftp_open fp, retrlen = fw.retrfile(file, type) File "/Users/pablo/opt/anaconda3/envs/get-assemblies2/lib/python3.10/urllib/request.py", line 2451, in retrfile raise URLError('ftp error: %r' % reason) from reason urllib.error.URLError: <urlopen error ftp error: error_perm('550 GCA_001367915.1_10493_1: No such file or directory')>
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/Users/pablo/opt/anaconda3/envs/get-assemblies2/lib/python3.10/site-packages/get_assemblies/main.py", line 121, in dl_gzip copy_url(pbar, task_id, uri, filename) File "/Users/pablo/opt/anaconda3/envs/get-assemblies2/lib/python3.10/site-packages/get_assemblies/main.py", line 142, in copy_url with urlopen(uri) as response: File "/Users/pablo/opt/anaconda3/envs/get-assemblies2/lib/python3.10/urllib/request.py", line 216, in urlopen return opener.open(url, data, timeout) File "/Users/pablo/opt/anaconda3/envs/get-assemblies2/lib/python3.10/urllib/request.py", line 519, in open response = self._open(req, data) File "/Users/pablo/opt/anaconda3/envs/get-assemblies2/lib/python3.10/urllib/request.py", line 536, in _open result = self._call_chain(self.handle_open, protocol, protocol + File "/Users/pablo/opt/anaconda3/envs/get-assemblies2/lib/python3.10/urllib/request.py", line 496, in _call_chain result = func(*args) File "/Users/pablo/opt/anaconda3/envs/get-assemblies2/lib/python3.10/urllib/request.py", line 1583, in ftp_open raise exc.with_traceback(sys.exc_info()[2]) File "/Users/pablo/opt/anaconda3/envs/get-assemblies2/lib/python3.10/urllib/request.py", line 1572, in ftp_open fp, retrlen = fw.retrfile(file, type) File "/Users/pablo/opt/anaconda3/envs/get-assemblies2/lib/python3.10/urllib/request.py", line 2451, in retrfile raise URLError('ftp error: %r' % reason) from reason urllib.error.URLError: <urlopen error ftp error: URLError("ftp error: error_perm('550 GCA_001367915.1_10493_1: No such file or directory')")>
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/Users/pablo/opt/anaconda3/envs/get-assemblies2/lib/python3.10/urllib/request.py", line 1565, in ftp_open fw = self.connect_ftp(user, passwd, host, port, dirs, req.timeout) File "/Users/pablo/opt/anaconda3/envs/get-assemblies2/lib/python3.10/urllib/request.py", line 1586, in connect_ftp return ftpwrapper(user, passwd, host, port, dirs, timeout, File "/Users/pablo/opt/anaconda3/envs/get-assemblies2/lib/python3.10/urllib/request.py", line 2407, in init self.init() File "/Users/pablo/opt/anaconda3/envs/get-assemblies2/lib/python3.10/urllib/request.py", line 2419, in init self.ftp.cwd(_target) File "/Users/pablo/opt/anaconda3/envs/get-assemblies2/lib/python3.10/ftplib.py", line 625, in cwd return self.voidcmd(cmd) File "/Users/pablo/opt/anaconda3/envs/get-assemblies2/lib/python3.10/ftplib.py", line 286, in voidcmd return self.voidresp() File "/Users/pablo/opt/anaconda3/envs/get-assemblies2/lib/python3.10/ftplib.py", line 259, in voidresp resp = self.getresp() File "/Users/pablo/opt/anaconda3/envs/get-assemblies2/lib/python3.10/ftplib.py", line 254, in getresp raise error_perm(resp) ftplib.error_perm: 550 genomes/all/GCA/001/367/915/GCF_001367915.1_10493_1_6: No such file or directory
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/Users/pablo/opt/anaconda3/envs/get-assemblies2/bin/get_assemblies", line 8, in
sys.exit(main())
File "/Users/pablo/opt/anaconda3/envs/get-assemblies2/lib/python3.10/site-packages/get_assemblies/main.py", line 1101, in main
download_genomes(args.o, dl_mapping, args.threads)
File "/Users/pablo/opt/anaconda3/envs/get-assemblies2/lib/python3.10/site-packages/get_assemblies/main.py", line 1058, in download_genomes
output = future.result()
File "/Users/pablo/opt/anaconda3/envs/get-assemblies2/lib/python3.10/concurrent/futures/_base.py", line 439, in result
return self.get_result()
File "/Users/pablo/opt/anaconda3/envs/get-assemblies2/lib/python3.10/concurrent/futures/_base.py", line 391, in get_result
raise self._exception
File "/Users/pablo/opt/anaconda3/envs/get-assemblies2/lib/python3.10/concurrent/futures/thread.py", line 58, in run
result = self.fn(*self.args, *self.kwargs)
File "/Users/pablo/opt/anaconda3/envs/get-assemblies2/lib/python3.10/site-packages/get_assemblies/main.py", line 126, in dl_gzip
copy_url(uri, task_id, uri, filename)
File "/Users/pablo/opt/anaconda3/envs/get-assemblies2/lib/python3.10/site-packages/get_assemblies/main.py", line 142, in copy_url
with urlopen(uri) as response:
File "/Users/pablo/opt/anaconda3/envs/get-assemblies2/lib/python3.10/urllib/request.py", line 216, in urlopen
return opener.open(url, data, timeout)
File "/Users/pablo/opt/anaconda3/envs/get-assemblies2/lib/python3.10/urllib/request.py", line 519, in open
response = self._open(req, data)
File "/Users/pablo/opt/anaconda3/envs/get-assemblies2/lib/python3.10/urllib/request.py", line 536, in _open
result = self._call_chain(self.handle_open, protocol, protocol +
File "/Users/pablo/opt/anaconda3/envs/get-assemblies2/lib/python3.10/urllib/request.py", line 496, in _call_chain
result = func(args)
File "/Users/pablo/opt/anaconda3/envs/get-assemblies2/lib/python3.10/urllib/request.py", line 1583, in ftp_open
raise exc.with_traceback(sys.exc_info()[2])
File "/Users/pablo/opt/anaconda3/envs/get-assemblies2/lib/python3.10/urllib/request.py", line 1565, in ftp_open
fw = self.connect_ftp(user, passwd, host, port, dirs, req.timeout)
File "/Users/pablo/opt/anaconda3/envs/get-assemblies2/lib/python3.10/urllib/request.py", line 1586, in connect_ftp
return ftpwrapper(user, passwd, host, port, dirs, timeout,
File "/Users/pablo/opt/anaconda3/envs/get-assemblies2/lib/python3.10/urllib/request.py", line 2407, in init
self.init()
File "/Users/pablo/opt/anaconda3/envs/get-assemblies2/lib/python3.10/urllib/request.py", line 2419, in init
self.ftp.cwd(_target)
File "/Users/pablo/opt/anaconda3/envs/get-assemblies2/lib/python3.10/ftplib.py", line 625, in cwd
return self.voidcmd(cmd)
File "/Users/pablo/opt/anaconda3/envs/get-assemblies2/lib/python3.10/ftplib.py", line 286, in voidcmd
return self.voidresp()
File "/Users/pablo/opt/anaconda3/envs/get-assemblies2/lib/python3.10/ftplib.py", line 259, in voidresp
resp = self.getresp()
File "/Users/pablo/opt/anaconda3/envs/get-assemblies2/lib/python3.10/ftplib.py", line 254, in getresp
raise error_perm(resp)
urllib.error.URLError: <urlopen error ftp error: error_perm('550 genomes/all/GCA/001/367/915/GCF_001367915.1_10493_1_6: No such file or directory')>
`
Is the error showing at the end important? I am attaching the log file as well.
Cheers, Pablo get_assemblies.log