Open ManuelPH96 opened 1 week ago
Hello @ManuelPH96,
I hope you are doing great and thank you for raising this issue !
I have never tried running such a large query on NCBI, but off the top of my head haystac
itself does not have a hardcoded upper limit of records that it can handle. I will try to reproduce your issue this week and then get back to you.
I hope this helps and thank you for your patience !
Best, Antony
Hello @ManuelPH96,
I hope you are well !
For your first question:
haystac
uses NCBI's E-utilities to fetch sequence records from the nuccore
and taxonomy
databases. Indeed there seems to be an upper limit on the records (10K) that can be retrieved using Esearch
. From the E-utilities documentation (https://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.ESearch):
retmax Total number of UIDs from the retrieved set to be shown in the XML output (default=20). By default, ESearch only includes the first 20 UIDs retrieved in the XML output. If usehistory is set to 'y', the remainder of the retrieved set will be stored on the History server; otherwise these UIDs are lost. Increasing retmax allows more of the retrieved UIDs to be included in the XML output, up to a maximum of 10,000 records.
We will try to see if we can overcome this limitation for a future version of haystac
though.
Next thing I did was to try to create a database of all the insect mt genomes. I simplified your query into the following:
"Insecta"[Organism] AND "complete genome"[All Fields] AND mitochondrion[filter]
. This query retrieves all the complete insect mitochondrial genomes. This way you exclude incomplete genome records or records that correspond to individual gene records only. Your database will then include full mt genomes (approximately) within the same order of magnitude in terms of genome size. Coincidentally this query also returns 9289 records, which is below the upper limit of 10000 records set by E-utilities.
I ran quick test using the following command haystac database --mode fetch --query '"Insecta"[Organism] AND "complete genome"[All Fields] AND mitochondrion[filter]' --mtDNA --output all_mt_insects_test --cores 4
and the records were downloaded successfully. These records in turn correspond to 4196 unique insect species.
I am attaching the entrez files in case they are helpful.
I hope this helps ! Best, Antony
Hi, sorry for the delay in my answer. Thanks a lot for the clarification and for giving me an alternative. I have also successfully built the database you suggest, and I have been able to analyse a number of samples.
Nevertheless, I notice that your query does not work for some taxons which definitely should be included:
"Insecta"[Organism] AND "complete genome"[All Fields] AND mitochondrion[filter]
## has 9289 entries
I test this variant of your query
"Insecta"[Organism] AND "mitochondrion"[All Fields]
[From 5000 to 100000](https://www.ncbi.nlm.nih.gov/nuccore#facet_range_divseqlen)(29,128)
This less stringent query retrieves sequences that also should follow your condition, such as this one, which is not part of the 9289 entries from your suggested query:
Goliathus goliatus mitochondrion, complete genome 18,699 bp circular DNA Accession: OK484303.1 GI: 2159367802
Could I download sequences belonging to "my" query to build a haystac database, in a way that haystac also performs the filtering of sequences belonging to the same species? In this way I could have more species represented...
Thanks in advance, ManuelPH
@ManuelPH96 Thank you to you and your team for bringing the limited entries issue to light. This turned out to our lab's barrier in getting an extensive database successfully built.
We're following this thread to see if there are any changes that we can make that work with the scope of our project.
Hello, @Jacq-Elope
Let's see if we can find a way... If your group manages to circumvent this issue please let us know. I will try too.
Have a nice day,
Hi @ManuelPH96 and @Jacq-Elope,
If you want to build a database with more than 10,000 sequences then I suggest the following:
--accessions-file sequence.seq
and --resolve-accessions
This will allow you to have more than 10,000 accessions whilst only keeping the largest sequence for each taxon.
Hi @ekirving @Jacq-Elope @antonisdim
I'm thankful for @ekirving's idea, and I think I'm close to implementing it, but I get error messages in haystac when trying to download the sequences. I will explain my workflow, based on @ekirving and haystac instructions
With length filtering and eliminating entries having the "whole genome" phrases I filter out sequences that are not mitochondrion, and I also eliminate UNVERIFIED entries.
"Insecta"[Organism] AND "mitochondrion"[All Fields] AND ("5000"[SLEN] : "50000"[SLEN]) NOT "UNVERIFIED"[All Fields] NOT "whole genome"[All Fields]'
24337 entries
" Providing custom accessions It is also possible to provide your own accessions for a selected species/taxon. For that you will need to prepare a tab delimited file with two columns. The first column is the name of the taxon, that cannot contain any special characters, other than an underscore (‘_’), and the second column is a valid NCBI accession code.
Here is an example of the contents of such a file::
Yersinia_ruckeri NZ_CP025800.1 "
To get this info from the NT database of NCBI I downloaded the summary file of my query (sorting by lenght of sequence), and then, with an awk script I coded, I retrieved said information from the summary file per taxon:
1. 1_Tps_b3v08
49,493 bp linear DNA
OD004731.1 GI:1946151791
2. Tenthredo livida genome assembly, organelle: mitochondrion
45,273 bp linear DNA
OZ023074.1 GI:2699132214
3. Lasioglossum calceatum genome assembly, organelle: mitochondrion
43,265 bp linear DNA
OZ020163.1 GI:2689480591
4. Dolichomitus mesocentrus genome assembly, organelle: mitochondrion
41,823 bp linear DNA
OZ187123.1 GI:2813315321
5. Trypoxylon attenuatum genome assembly, organelle: mitochondrion
41,351 bp linear DNA
OZ022497.1 GI:2698077353
1_Tps_b3v08_ OD004731.1
Tenthredo_livida OZ023074.1
Lasioglossum_calceatum OZ020163.1
Dolichomitus_mesocentrus OZ187123.1
Trypoxylon_attenuatum OZ022497.1
I checked with R the number of unique taxons in the accessions file and it is 10763.
nice haystac database --mode build --accessions-file ../../results/29-10-2024/formatted_results.tsv --resolve-accessions --output ../../data/29-10-2024_all_mt_insects_accessions --cores 50
the same command but with fetch
instead of build did not work either
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 50
Rules claiming more threads will be scaled down.
Provided resources: entrez_api=3, mem_mb=1031922
Job counts:
count jobs
1 entrez_db_list
9598 entrez_download_sequence
9599
[Wed Oct 30 15:19:47 2024]
Job 6814: Downloading accession OQ873427.1 for taxon Batracomorphus_rinkihonis.
[Wed Oct 30 15:19:47 2024]
Job 6813: Downloading accession OX388053.1 for taxon Lacanobia_wlatinum.
[Wed Oct 30 15:19:47 2024]
Job 8043: Downloading accession ON165248.1 for taxon Bassarona_dunya.
[Wed Oct 30 15:19:47 2024]
[Wed Oct 30 15:19:47 2024]
Error in rule entrez_download_sequence:
[Wed Oct 30 15:19:47 2024]
Error in rule entrez_download_sequence:
jobid: 6813
Error in rule entrez_download_sequence:
jobid: 8043
output: /home/manuelpinero/haystac/cache/conda/beff389787b1b0d676f536d2f4bdaade/ncbi/Lacanobia_wlatinum/OX388053.1.fasta.gz
jobid: 6814
output: /home/manuelpinero/haystac/cache/conda/beff389787b1b0d676f536d2f4bdaade/ncbi/Bassarona_dunya/ON165248.1.fasta.gz
conda-env: /home/manuelpinero/haystac/cache/conda/beff389787b1b0d676f536d2f4bdaade/conda/54450b024a9f7d99e76428575843ff5b
output: /home/manuelpinero/haystac/cache/conda/beff389787b1b0d676f536d2f4bdaade/ncbi/Batracomorphus_rinkihonis/OQ873427.1.fasta.gz
conda-env: /home/manuelpinero/haystac/cache/conda/beff389787b1b0d676f536d2f4bdaade/conda/54450b024a9f7d99e76428575843ff5b
conda-env: /home/manuelpinero/haystac/cache/conda/beff389787b1b0d676f536d2f4bdaade/conda/54450b024a9f7d99e76428575843ff5b
RuleException:
AttributeError in line 66 of /home/pablolibrado/miniconda3/envs/haystac/lib/python3.9/site-packages/haystac/workflow/rules/entrez.smk:
'Conda' object has no attribute 'info'
File "/home/pablolibrado/miniconda3/envs/haystac/lib/python3.9/site-packages/snakemake/executors/__init__.py", line 2330, in run_wrapper
File "/home/pablolibrado/miniconda3/envs/haystac/lib/python3.9/site-packages/haystac/workflow/rules/entrez.smk", line 66, in __rule_entrez_download_sequence
File "/home/pablolibrado/miniconda3/envs/haystac/lib/python3.9/site-packages/snakemake/deployment/conda.py", line 519, in shellcmd
File "/home/pablolibrado/miniconda3/envs/haystac/lib/python3.9/site-packages/snakemake/deployment/conda.py", line 515, in bin_path
File "/home/pablolibrado/miniconda3/envs/haystac/lib/python3.9/site-packages/snakemake/deployment/conda.py", line 512, in prefix_path
File "/home/pablolibrado/miniconda3/envs/haystac/lib/python3.9/site-packages/snakemake/executors/__init__.py", line 569, in _callback
File "/home/pablolibrado/miniconda3/envs/haystac/lib/python3.9/concurrent/futures/thread.py", line 58, in run
File "/home/pablolibrado/miniconda3/envs/haystac/lib/python3.9/site-packages/snakemake/executors/__init__.py", line 555, in cached_or_run
File "/home/pablolibrado/miniconda3/envs/haystac/lib/python3.9/site-packages/snakemake/executors/__init__.py", line 2362, in run_wrapper
RuleException:
CreateCondaEnvironmentException in line 66 of /home/pablolibrado/miniconda3/envs/haystac/lib/python3.9/site-packages/haystac/workflow/rules/entrez.smk:
The 'conda' command is not available in the shell /usr/bin/bash that will be used by Snakemake. You have to ensure that it is in your PATH, e.g., first activating the conda base environment with `conda activate base`.
File "/home/pablolibrado/miniconda3/envs/haystac/lib/python3.9/site-packages/snakemake/executors/__init__.py", line 2330, in run_wrapper
File "/home/pablolibrado/miniconda3/envs/haystac/lib/python3.9/site-packages/haystac/workflow/rules/entrez.smk", line 66, in __rule_entrez_download_sequence
File "/home/pablolibrado/miniconda3/envs/haystac/lib/python3.9/site-packages/snakemake/deployment/conda.py", line 425, in __new__
File "/home/pablolibrado/miniconda3/envs/haystac/lib/python3.9/site-packages/snakemake/deployment/conda.py", line 474, in _check
File "/home/pablolibrado/miniconda3/envs/haystac/lib/python3.9/site-packages/snakemake/executors/__init__.py", line 569, in _callback
File "/home/pablolibrado/miniconda3/envs/haystac/lib/python3.9/concurrent/futures/thread.py", line 58, in run
File "/home/pablolibrado/miniconda3/envs/haystac/lib/python3.9/site-packages/snakemake/executors/__init__.py", line 555, in cached_or_run
File "/home/pablolibrado/miniconda3/envs/haystac/lib/python3.9/site-packages/snakemake/executors/__init__.py", line 2362, in run_wrapper
RuleException:
AttributeError in line 66 of /home/pablolibrado/miniconda3/envs/haystac/lib/python3.9/site-packages/haystac/workflow/rules/entrez.smk:
'Conda' object has no attribute 'info'
File "/home/pablolibrado/miniconda3/envs/haystac/lib/python3.9/site-packages/snakemake/executors/__init__.py", line 2330, in run_wrapper
File "/home/pablolibrado/miniconda3/envs/haystac/lib/python3.9/site-packages/haystac/workflow/rules/entrez.smk", line 66, in __rule_entrez_download_sequence
File "/home/pablolibrado/miniconda3/envs/haystac/lib/python3.9/site-packages/snakemake/deployment/conda.py", line 519, in shellcmd
File "/home/pablolibrado/miniconda3/envs/haystac/lib/python3.9/site-packages/snakemake/deployment/conda.py", line 515, in bin_path
File "/home/pablolibrado/miniconda3/envs/haystac/lib/python3.9/site-packages/snakemake/deployment/conda.py", line 512, in prefix_path
File "/home/pablolibrado/miniconda3/envs/haystac/lib/python3.9/site-packages/snakemake/executors/__init__.py", line 569, in _callback
File "/home/pablolibrado/miniconda3/envs/haystac/lib/python3.9/concurrent/futures/thread.py", line 58, in run
File "/home/pablolibrado/miniconda3/envs/haystac/lib/python3.9/site-packages/snakemake/executors/__init__.py", line 555, in cached_or_run
File "/home/pablolibrado/miniconda3/envs/haystac/lib/python3.9/site-packages/snakemake/executors/__init__.py", line 2362, in run_wrapper
Trying to restart job 6813.
Trying to restart job 6814.
Trying to restart job 8043.
[Wed Oct 30 15:19:47 2024]
Job 6814: Downloading accession OQ873427.1 for taxon Batracomorphus_rinkihonis.
[Wed Oct 30 15:19:47 2024]
Job 6813: Downloading accession OX388053.1 for taxon Lacanobia_wlatinum.
[Wed Oct 30 15:19:47 2024]
Job 8043: Downloading accession ON165248.1 for taxon Bassarona_dunya.
[Wed Oct 30 15:19:48 2024]
[Wed Oct 30 15:19:48 2024]
Error in rule entrez_download_sequence:
Error in rule entrez_download_sequence:
jobid: 6814
jobid: 6813
output: /home/manuelpinero/haystac/cache/conda/beff389787b1b0d676f536d2f4bdaade/ncbi/Batracomorphus_rinkihonis/OQ873427.1.fasta.gz
output: /home/manuelpinero/haystac/cache/conda/beff389787b1b0d676f536d2f4bdaade/ncbi/Lacanobia_wlatinum/OX388053.1.fasta.gz
conda-env: /home/manuelpinero/haystac/cache/conda/beff389787b1b0d676f536d2f4bdaade/conda/54450b024a9f7d99e76428575843ff5b
[Wed Oct 30 15:19:48 2024]
conda-env: /home/manuelpinero/haystac/cache/conda/beff389787b1b0d676f536d2f4bdaade/conda/54450b024a9f7d99e76428575843ff5b
Error in rule entrez_download_sequence:
jobid: 8043
RuleException:
AttributeError in line 66 of /home/pablolibrado/miniconda3/envs/haystac/lib/python3.9/site-packages/haystac/workflow/rules/entrez.smk:
'Conda' object has no attribute 'info'
File "/home/pablolibrado/miniconda3/envs/haystac/lib/python3.9/site-packages/snakemake/executors/__init__.py", line 2330, in run_wrapper
File "/home/pablolibrado/miniconda3/envs/haystac/lib/python3.9/site-packages/haystac/workflow/rules/entrez.smk", line 66, in __rule_entrez_download_sequence
File "/home/pablolibrado/miniconda3/envs/haystac/lib/python3.9/site-packages/snakemake/deployment/conda.py", line 519, in shellcmd
File "/home/pablolibrado/miniconda3/envs/haystac/lib/python3.9/site-packages/snakemake/deployment/conda.py", line 515, in bin_path
File "/home/pablolibrado/miniconda3/envs/haystac/lib/python3.9/site-packages/snakemake/deployment/conda.py", line 512, in prefix_path
File "/home/pablolibrado/miniconda3/envs/haystac/lib/python3.9/site-packages/snakemake/executors/__init__.py", line 569, in _callback
File "/home/pablolibrado/miniconda3/envs/haystac/lib/python3.9/concurrent/futures/thread.py", line 58, in run
File "/home/pablolibrado/miniconda3/envs/haystac/lib/python3.9/site-packages/snakemake/executors/__init__.py", line 555, in cached_or_run
File "/home/pablolibrado/miniconda3/envs/haystac/lib/python3.9/site-packages/snakemake/executors/__init__.py", line 2362, in run_wrapper
RuleException:
AttributeError in line 66 of /home/pablolibrado/miniconda3/envs/haystac/lib/python3.9/site-packages/haystac/workflow/rules/entrez.smk:
'Conda' object has no attribute 'info'
File "/home/pablolibrado/miniconda3/envs/haystac/lib/python3.9/site-packages/snakemake/executors/__init__.py", line 2330, in run_wrapper
File "/home/pablolibrado/miniconda3/envs/haystac/lib/python3.9/site-packages/haystac/workflow/rules/entrez.smk", line 66, in __rule_entrez_download_sequence
File "/home/pablolibrado/miniconda3/envs/haystac/lib/python3.9/site-packages/snakemake/deployment/conda.py", line 519, in shellcmd
File "/home/pablolibrado/miniconda3/envs/haystac/lib/python3.9/site-packages/snakemake/deployment/conda.py", line 515, in bin_path
File "/home/pablolibrado/miniconda3/envs/haystac/lib/python3.9/site-packages/snakemake/deployment/conda.py", line 512, in prefix_path
File "/home/pablolibrado/miniconda3/envs/haystac/lib/python3.9/site-packages/snakemake/executors/__init__.py", line 569, in _callback
File "/home/pablolibrado/miniconda3/envs/haystac/lib/python3.9/concurrent/futures/thread.py", line 58, in run
File "/home/pablolibrado/miniconda3/envs/haystac/lib/python3.9/site-packages/snakemake/executors/__init__.py", line 555, in cached_or_run
File "/home/pablolibrado/miniconda3/envs/haystac/lib/python3.9/site-packages/snakemake/executors/__init__.py", line 2362, in run_wrapper
output: /home/manuelpinero/haystac/cache/conda/beff389787b1b0d676f536d2f4bdaade/ncbi/Bassarona_dunya/ON165248.1.fasta.gz
Trying to restart job 6813.
conda-env: /home/manuelpinero/haystac/cache/conda/beff389787b1b0d676f536d2f4bdaade/conda/54450b024a9f7d99e76428575843ff5b
Trying to restart job 6814.
RuleException:
AttributeError in line 66 of /home/pablolibrado/miniconda3/envs/haystac/lib/python3.9/site-packages/haystac/workflow/rules/entrez.smk:
'Conda' object has no attribute 'info'
File "/home/pablolibrado/miniconda3/envs/haystac/lib/python3.9/site-packages/snakemake/executors/__init__.py", line 2330, in run_wrapper
File "/home/pablolibrado/miniconda3/envs/haystac/lib/python3.9/site-packages/haystac/workflow/rules/entrez.smk", line 66, in __rule_entrez_download_sequence
File "/home/pablolibrado/miniconda3/envs/haystac/lib/python3.9/site-packages/snakemake/deployment/conda.py", line 519, in shellcmd
File "/home/pablolibrado/miniconda3/envs/haystac/lib/python3.9/site-packages/snakemake/deployment/conda.py", line 515, in bin_path
File "/home/pablolibrado/miniconda3/envs/haystac/lib/python3.9/site-packages/snakemake/deployment/conda.py", line 512, in prefix_path
File "/home/pablolibrado/miniconda3/envs/haystac/lib/python3.9/site-packages/snakemake/executors/__init__.py", line 569, in _callback
File "/home/pablolibrado/miniconda3/envs/haystac/lib/python3.9/concurrent/futures/thread.py", line 58, in run
File "/home/pablolibrado/miniconda3/envs/haystac/lib/python3.9/site-packages/snakemake/executors/__init__.py", line 555, in cached_or_run
File "/home/pablolibrado/miniconda3/envs/haystac/lib/python3.9/site-packages/snakemake/executors/__init__.py", line 2362, in run_wrapper
Trying to restart job 8043.
I stopped the job because no sequence appears to be successfully downloaded. These errors were the same as those found by @Jacq-Elope and me before my supervisor find the incompatibility problems in python libraries (see issue 23), but now this is fixed I don't really get what I am doing wrong.
I hope this helps someone, and if there is a hint on what I am missing please let me know,
Thanks for everything,
ManuelPH
Hi everyone,
I want to create a database with all sequences of mithocondria from insects, like this:
Said query of NCBI Nucleotide has around 3M entries. According to haystac bibliography, they performed some filtering to only retrieve one representative per species, which basically is choosing the longest sequences; I expected this 3M entries to be substantially reduced. My surprise is that the reduction was unexpectedly intense, and quite weird. The file entrez-nuccre.tsv has 9999 entries, making 10000 rows with header. After this, the selected sequences tsv only has 2701 sequencies. It looks like haystac only accepted 9999 entries of the query. I have spent some time reading the code and the tutorial and I can't find a way to know if my hypothesis is true or not, and I'd like to know the way of not having this restriction.
Thanks in advance,
ManuelPH