antonisdim / haystac

Code repository for the HAYSTAC pipeline
MIT License
13 stars 4 forks source link

Limit on number of entries a HAYSTAC query can handle to produce database #24

Open ManuelPH96 opened 1 week ago

ManuelPH96 commented 1 week ago

Hi everyone,

I want to create a database with all sequences of mithocondria from insects, like this:

 haystac database \
    --mode build --mtDNA \
    --query '("Insecta"[Organism] AND "mitochondrial"[All Fields]) AND "insects"[porgn] AND (animals[filter] AND biomol_genomic[PROP] AND mitochondrion[filter])' \
    --output mitochondrial_all_insecta_db --cores 100

Said query of NCBI Nucleotide has around 3M entries. According to haystac bibliography, they performed some filtering to only retrieve one representative per species, which basically is choosing the longest sequences; I expected this 3M entries to be substantially reduced. My surprise is that the reduction was unexpectedly intense, and quite weird. The file entrez-nuccre.tsv has 9999 entries, making 10000 rows with header. After this, the selected sequences tsv only has 2701 sequencies. It looks like haystac only accepted 9999 entries of the query. I have spent some time reading the code and the tutorial and I can't find a way to know if my hypothesis is true or not, and I'd like to know the way of not having this restriction.

Thanks in advance,

ManuelPH

antonisdim commented 1 week ago

Hello @ManuelPH96,

I hope you are doing great and thank you for raising this issue !

I have never tried running such a large query on NCBI, but off the top of my head haystac itself does not have a hardcoded upper limit of records that it can handle. I will try to reproduce your issue this week and then get back to you.

I hope this helps and thank you for your patience !

Best, Antony

antonisdim commented 1 week ago

Hello @ManuelPH96,

I hope you are well !

For your first question: haystac uses NCBI's E-utilities to fetch sequence records from the nuccore and taxonomy databases. Indeed there seems to be an upper limit on the records (10K) that can be retrieved using Esearch. From the E-utilities documentation (https://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.ESearch):

retmax Total number of UIDs from the retrieved set to be shown in the XML output (default=20). By default, ESearch only includes the first 20 UIDs retrieved in the XML output. If usehistory is set to 'y', the remainder of the retrieved set will be stored on the History server; otherwise these UIDs are lost. Increasing retmax allows more of the retrieved UIDs to be included in the XML output, up to a maximum of 10,000 records.

We will try to see if we can overcome this limitation for a future version of haystac though.

Next thing I did was to try to create a database of all the insect mt genomes. I simplified your query into the following: "Insecta"[Organism] AND "complete genome"[All Fields] AND mitochondrion[filter]. This query retrieves all the complete insect mitochondrial genomes. This way you exclude incomplete genome records or records that correspond to individual gene records only. Your database will then include full mt genomes (approximately) within the same order of magnitude in terms of genome size. Coincidentally this query also returns 9289 records, which is below the upper limit of 10000 records set by E-utilities.

I ran quick test using the following command haystac database --mode fetch --query '"Insecta"[Organism] AND "complete genome"[All Fields] AND mitochondrion[filter]' --mtDNA --output all_mt_insects_test --cores 4 and the records were downloaded successfully. These records in turn correspond to 4196 unique insect species.

I am attaching the entrez files in case they are helpful.

I hope this helps ! Best, Antony

entrez-nuccore.txt entrez-taxa.txt

ManuelPH96 commented 2 days ago

Hi, sorry for the delay in my answer. Thanks a lot for the clarification and for giving me an alternative. I have also successfully built the database you suggest, and I have been able to analyse a number of samples.

Nevertheless, I notice that your query does not work for some taxons which definitely should be included:

"Insecta"[Organism] AND "complete genome"[All Fields] AND mitochondrion[filter] 
## has 9289 entries

I test this variant of your query

"Insecta"[Organism] AND "mitochondrion"[All Fields] 
[From 5000 to 100000](https://www.ncbi.nlm.nih.gov/nuccore#facet_range_divseqlen)(29,128)

This less stringent query retrieves sequences that also should follow your condition, such as this one, which is not part of the 9289 entries from your suggested query:

Goliathus goliatus mitochondrion, complete genome 18,699 bp circular DNA Accession: OK484303.1 GI: 2159367802

Could I download sequences belonging to "my" query to build a haystac database, in a way that haystac also performs the filtering of sequences belonging to the same species? In this way I could have more species represented...

Thanks in advance, ManuelPH

Jacq-Elope commented 1 day ago

@ManuelPH96 Thank you to you and your team for bringing the limited entries issue to light. This turned out to our lab's barrier in getting an extensive database successfully built.

We're following this thread to see if there are any changes that we can make that work with the scope of our project.

ManuelPH96 commented 1 day ago

Hello, @Jacq-Elope

Let's see if we can find a way... If your group manages to circumvent this issue please let us know. I will try too.

Have a nice day,

ekirving commented 1 day ago

Hi @ManuelPH96 and @Jacq-Elope,

If you want to build a database with more than 10,000 sequences then I suggest the following:

  1. Run your query on NCBI
  2. Choose "Sort by Sequence Length" (this puts the longest sequences first)
  3. Choose "Send to" > "File" > Format: "Accession list"
  4. Build your database using --accessions-file sequence.seq and --resolve-accessions

This will allow you to have more than 10,000 accessions whilst only keeping the largest sequence for each taxon.

ManuelPH96 commented 49 minutes ago

Hi @ekirving @Jacq-Elope @antonisdim

I'm thankful for @ekirving's idea, and I think I'm close to implementing it, but I get error messages in haystac when trying to download the sequences. I will explain my workflow, based on @ekirving and haystac instructions

1) Query in NCBI

With length filtering and eliminating entries having the "whole genome" phrases I filter out sequences that are not mitochondrion, and I also eliminate UNVERIFIED entries.

"Insecta"[Organism] AND "mitochondrion"[All Fields] AND ("5000"[SLEN] : "50000"[SLEN]) NOT "UNVERIFIED"[All Fields] NOT "whole genome"[All Fields]'
 24337 entries

2) I had to make an accessions file following HAYSTAC intructions:

" Providing custom accessions It is also possible to provide your own accessions for a selected species/taxon. For that you will need to prepare a tab delimited file with two columns. The first column is the name of the taxon, that cannot contain any special characters, other than an underscore (‘_’), and the second column is a valid NCBI accession code.

Here is an example of the contents of such a file::

Yersinia_ruckeri NZ_CP025800.1 "

To get this info from the NT database of NCBI I downloaded the summary file of my query (sorting by lenght of sequence), and then, with an awk script I coded, I retrieved said information from the summary file per taxon:

Summary file example

1. 1_Tps_b3v08
49,493 bp linear DNA 
OD004731.1 GI:1946151791

2. Tenthredo livida genome assembly, organelle: mitochondrion
45,273 bp linear DNA 
OZ023074.1 GI:2699132214

3. Lasioglossum calceatum genome assembly, organelle: mitochondrion
43,265 bp linear DNA 
OZ020163.1 GI:2689480591

4. Dolichomitus mesocentrus genome assembly, organelle: mitochondrion
41,823 bp linear DNA 
OZ187123.1 GI:2813315321

5. Trypoxylon attenuatum genome assembly, organelle: mitochondrion
41,351 bp linear DNA 
OZ022497.1 GI:2698077353

Accession file example

1_Tps_b3v08_    OD004731.1
Tenthredo_livida    OZ023074.1
Lasioglossum_calceatum  OZ020163.1
Dolichomitus_mesocentrus    OZ187123.1
Trypoxylon_attenuatum   OZ022497.1

I checked with R the number of unique taxons in the accessions file and it is 10763.

3) Now, when trying to build the database with HAYSTAC

nice haystac database --mode build --accessions-file ../../results/29-10-2024/formatted_results.tsv --resolve-accessions --output ../../data/29-10-2024_all_mt_insects_accessions --cores 50

the same command but with fetch instead of build did not work either

Unknown errors when downloading sequences

Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 50
Rules claiming more threads will be scaled down.
Provided resources: entrez_api=3, mem_mb=1031922
Job counts:
    count   jobs
    1   entrez_db_list
    9598    entrez_download_sequence
    9599

[Wed Oct 30 15:19:47 2024]
Job 6814: Downloading accession OQ873427.1 for taxon Batracomorphus_rinkihonis.

[Wed Oct 30 15:19:47 2024]
Job 6813: Downloading accession OX388053.1 for taxon Lacanobia_wlatinum.

[Wed Oct 30 15:19:47 2024]
Job 8043: Downloading accession ON165248.1 for taxon Bassarona_dunya.

[Wed Oct 30 15:19:47 2024]
[Wed Oct 30 15:19:47 2024]
Error in rule entrez_download_sequence:
[Wed Oct 30 15:19:47 2024]
Error in rule entrez_download_sequence:
    jobid: 6813
Error in rule entrez_download_sequence:
    jobid: 8043
    output: /home/manuelpinero/haystac/cache/conda/beff389787b1b0d676f536d2f4bdaade/ncbi/Lacanobia_wlatinum/OX388053.1.fasta.gz
    jobid: 6814
    output: /home/manuelpinero/haystac/cache/conda/beff389787b1b0d676f536d2f4bdaade/ncbi/Bassarona_dunya/ON165248.1.fasta.gz
    conda-env: /home/manuelpinero/haystac/cache/conda/beff389787b1b0d676f536d2f4bdaade/conda/54450b024a9f7d99e76428575843ff5b
    output: /home/manuelpinero/haystac/cache/conda/beff389787b1b0d676f536d2f4bdaade/ncbi/Batracomorphus_rinkihonis/OQ873427.1.fasta.gz
    conda-env: /home/manuelpinero/haystac/cache/conda/beff389787b1b0d676f536d2f4bdaade/conda/54450b024a9f7d99e76428575843ff5b

    conda-env: /home/manuelpinero/haystac/cache/conda/beff389787b1b0d676f536d2f4bdaade/conda/54450b024a9f7d99e76428575843ff5b

RuleException:
AttributeError in line 66 of /home/pablolibrado/miniconda3/envs/haystac/lib/python3.9/site-packages/haystac/workflow/rules/entrez.smk:
'Conda' object has no attribute 'info'
  File "/home/pablolibrado/miniconda3/envs/haystac/lib/python3.9/site-packages/snakemake/executors/__init__.py", line 2330, in run_wrapper
  File "/home/pablolibrado/miniconda3/envs/haystac/lib/python3.9/site-packages/haystac/workflow/rules/entrez.smk", line 66, in __rule_entrez_download_sequence
  File "/home/pablolibrado/miniconda3/envs/haystac/lib/python3.9/site-packages/snakemake/deployment/conda.py", line 519, in shellcmd
  File "/home/pablolibrado/miniconda3/envs/haystac/lib/python3.9/site-packages/snakemake/deployment/conda.py", line 515, in bin_path
  File "/home/pablolibrado/miniconda3/envs/haystac/lib/python3.9/site-packages/snakemake/deployment/conda.py", line 512, in prefix_path
  File "/home/pablolibrado/miniconda3/envs/haystac/lib/python3.9/site-packages/snakemake/executors/__init__.py", line 569, in _callback
  File "/home/pablolibrado/miniconda3/envs/haystac/lib/python3.9/concurrent/futures/thread.py", line 58, in run
  File "/home/pablolibrado/miniconda3/envs/haystac/lib/python3.9/site-packages/snakemake/executors/__init__.py", line 555, in cached_or_run
  File "/home/pablolibrado/miniconda3/envs/haystac/lib/python3.9/site-packages/snakemake/executors/__init__.py", line 2362, in run_wrapper
RuleException:
CreateCondaEnvironmentException in line 66 of /home/pablolibrado/miniconda3/envs/haystac/lib/python3.9/site-packages/haystac/workflow/rules/entrez.smk:
The 'conda' command is not available in the shell /usr/bin/bash that will be used by Snakemake. You have to ensure that it is in your PATH, e.g., first activating the conda base environment with `conda activate base`.
  File "/home/pablolibrado/miniconda3/envs/haystac/lib/python3.9/site-packages/snakemake/executors/__init__.py", line 2330, in run_wrapper
  File "/home/pablolibrado/miniconda3/envs/haystac/lib/python3.9/site-packages/haystac/workflow/rules/entrez.smk", line 66, in __rule_entrez_download_sequence
  File "/home/pablolibrado/miniconda3/envs/haystac/lib/python3.9/site-packages/snakemake/deployment/conda.py", line 425, in __new__
  File "/home/pablolibrado/miniconda3/envs/haystac/lib/python3.9/site-packages/snakemake/deployment/conda.py", line 474, in _check
  File "/home/pablolibrado/miniconda3/envs/haystac/lib/python3.9/site-packages/snakemake/executors/__init__.py", line 569, in _callback
  File "/home/pablolibrado/miniconda3/envs/haystac/lib/python3.9/concurrent/futures/thread.py", line 58, in run
  File "/home/pablolibrado/miniconda3/envs/haystac/lib/python3.9/site-packages/snakemake/executors/__init__.py", line 555, in cached_or_run
  File "/home/pablolibrado/miniconda3/envs/haystac/lib/python3.9/site-packages/snakemake/executors/__init__.py", line 2362, in run_wrapper
RuleException:
AttributeError in line 66 of /home/pablolibrado/miniconda3/envs/haystac/lib/python3.9/site-packages/haystac/workflow/rules/entrez.smk:
'Conda' object has no attribute 'info'
  File "/home/pablolibrado/miniconda3/envs/haystac/lib/python3.9/site-packages/snakemake/executors/__init__.py", line 2330, in run_wrapper
  File "/home/pablolibrado/miniconda3/envs/haystac/lib/python3.9/site-packages/haystac/workflow/rules/entrez.smk", line 66, in __rule_entrez_download_sequence
  File "/home/pablolibrado/miniconda3/envs/haystac/lib/python3.9/site-packages/snakemake/deployment/conda.py", line 519, in shellcmd
  File "/home/pablolibrado/miniconda3/envs/haystac/lib/python3.9/site-packages/snakemake/deployment/conda.py", line 515, in bin_path
  File "/home/pablolibrado/miniconda3/envs/haystac/lib/python3.9/site-packages/snakemake/deployment/conda.py", line 512, in prefix_path
  File "/home/pablolibrado/miniconda3/envs/haystac/lib/python3.9/site-packages/snakemake/executors/__init__.py", line 569, in _callback
  File "/home/pablolibrado/miniconda3/envs/haystac/lib/python3.9/concurrent/futures/thread.py", line 58, in run
  File "/home/pablolibrado/miniconda3/envs/haystac/lib/python3.9/site-packages/snakemake/executors/__init__.py", line 555, in cached_or_run
  File "/home/pablolibrado/miniconda3/envs/haystac/lib/python3.9/site-packages/snakemake/executors/__init__.py", line 2362, in run_wrapper
Trying to restart job 6813.
Trying to restart job 6814.
Trying to restart job 8043.

[Wed Oct 30 15:19:47 2024]
Job 6814: Downloading accession OQ873427.1 for taxon Batracomorphus_rinkihonis.

[Wed Oct 30 15:19:47 2024]
Job 6813: Downloading accession OX388053.1 for taxon Lacanobia_wlatinum.

[Wed Oct 30 15:19:47 2024]
Job 8043: Downloading accession ON165248.1 for taxon Bassarona_dunya.

[Wed Oct 30 15:19:48 2024]
[Wed Oct 30 15:19:48 2024]
Error in rule entrez_download_sequence:
Error in rule entrez_download_sequence:
    jobid: 6814
    jobid: 6813
    output: /home/manuelpinero/haystac/cache/conda/beff389787b1b0d676f536d2f4bdaade/ncbi/Batracomorphus_rinkihonis/OQ873427.1.fasta.gz
    output: /home/manuelpinero/haystac/cache/conda/beff389787b1b0d676f536d2f4bdaade/ncbi/Lacanobia_wlatinum/OX388053.1.fasta.gz
    conda-env: /home/manuelpinero/haystac/cache/conda/beff389787b1b0d676f536d2f4bdaade/conda/54450b024a9f7d99e76428575843ff5b
[Wed Oct 30 15:19:48 2024]
    conda-env: /home/manuelpinero/haystac/cache/conda/beff389787b1b0d676f536d2f4bdaade/conda/54450b024a9f7d99e76428575843ff5b

Error in rule entrez_download_sequence:

    jobid: 8043
RuleException:
AttributeError in line 66 of /home/pablolibrado/miniconda3/envs/haystac/lib/python3.9/site-packages/haystac/workflow/rules/entrez.smk:
'Conda' object has no attribute 'info'
  File "/home/pablolibrado/miniconda3/envs/haystac/lib/python3.9/site-packages/snakemake/executors/__init__.py", line 2330, in run_wrapper
  File "/home/pablolibrado/miniconda3/envs/haystac/lib/python3.9/site-packages/haystac/workflow/rules/entrez.smk", line 66, in __rule_entrez_download_sequence
  File "/home/pablolibrado/miniconda3/envs/haystac/lib/python3.9/site-packages/snakemake/deployment/conda.py", line 519, in shellcmd
  File "/home/pablolibrado/miniconda3/envs/haystac/lib/python3.9/site-packages/snakemake/deployment/conda.py", line 515, in bin_path
  File "/home/pablolibrado/miniconda3/envs/haystac/lib/python3.9/site-packages/snakemake/deployment/conda.py", line 512, in prefix_path
  File "/home/pablolibrado/miniconda3/envs/haystac/lib/python3.9/site-packages/snakemake/executors/__init__.py", line 569, in _callback
  File "/home/pablolibrado/miniconda3/envs/haystac/lib/python3.9/concurrent/futures/thread.py", line 58, in run
  File "/home/pablolibrado/miniconda3/envs/haystac/lib/python3.9/site-packages/snakemake/executors/__init__.py", line 555, in cached_or_run
  File "/home/pablolibrado/miniconda3/envs/haystac/lib/python3.9/site-packages/snakemake/executors/__init__.py", line 2362, in run_wrapper
RuleException:
AttributeError in line 66 of /home/pablolibrado/miniconda3/envs/haystac/lib/python3.9/site-packages/haystac/workflow/rules/entrez.smk:
'Conda' object has no attribute 'info'
  File "/home/pablolibrado/miniconda3/envs/haystac/lib/python3.9/site-packages/snakemake/executors/__init__.py", line 2330, in run_wrapper
  File "/home/pablolibrado/miniconda3/envs/haystac/lib/python3.9/site-packages/haystac/workflow/rules/entrez.smk", line 66, in __rule_entrez_download_sequence
  File "/home/pablolibrado/miniconda3/envs/haystac/lib/python3.9/site-packages/snakemake/deployment/conda.py", line 519, in shellcmd
  File "/home/pablolibrado/miniconda3/envs/haystac/lib/python3.9/site-packages/snakemake/deployment/conda.py", line 515, in bin_path
  File "/home/pablolibrado/miniconda3/envs/haystac/lib/python3.9/site-packages/snakemake/deployment/conda.py", line 512, in prefix_path
  File "/home/pablolibrado/miniconda3/envs/haystac/lib/python3.9/site-packages/snakemake/executors/__init__.py", line 569, in _callback
  File "/home/pablolibrado/miniconda3/envs/haystac/lib/python3.9/concurrent/futures/thread.py", line 58, in run
  File "/home/pablolibrado/miniconda3/envs/haystac/lib/python3.9/site-packages/snakemake/executors/__init__.py", line 555, in cached_or_run
  File "/home/pablolibrado/miniconda3/envs/haystac/lib/python3.9/site-packages/snakemake/executors/__init__.py", line 2362, in run_wrapper
    output: /home/manuelpinero/haystac/cache/conda/beff389787b1b0d676f536d2f4bdaade/ncbi/Bassarona_dunya/ON165248.1.fasta.gz
Trying to restart job 6813.
    conda-env: /home/manuelpinero/haystac/cache/conda/beff389787b1b0d676f536d2f4bdaade/conda/54450b024a9f7d99e76428575843ff5b

Trying to restart job 6814.
RuleException:
AttributeError in line 66 of /home/pablolibrado/miniconda3/envs/haystac/lib/python3.9/site-packages/haystac/workflow/rules/entrez.smk:
'Conda' object has no attribute 'info'
  File "/home/pablolibrado/miniconda3/envs/haystac/lib/python3.9/site-packages/snakemake/executors/__init__.py", line 2330, in run_wrapper
  File "/home/pablolibrado/miniconda3/envs/haystac/lib/python3.9/site-packages/haystac/workflow/rules/entrez.smk", line 66, in __rule_entrez_download_sequence
  File "/home/pablolibrado/miniconda3/envs/haystac/lib/python3.9/site-packages/snakemake/deployment/conda.py", line 519, in shellcmd
  File "/home/pablolibrado/miniconda3/envs/haystac/lib/python3.9/site-packages/snakemake/deployment/conda.py", line 515, in bin_path
  File "/home/pablolibrado/miniconda3/envs/haystac/lib/python3.9/site-packages/snakemake/deployment/conda.py", line 512, in prefix_path
  File "/home/pablolibrado/miniconda3/envs/haystac/lib/python3.9/site-packages/snakemake/executors/__init__.py", line 569, in _callback
  File "/home/pablolibrado/miniconda3/envs/haystac/lib/python3.9/concurrent/futures/thread.py", line 58, in run
  File "/home/pablolibrado/miniconda3/envs/haystac/lib/python3.9/site-packages/snakemake/executors/__init__.py", line 555, in cached_or_run
  File "/home/pablolibrado/miniconda3/envs/haystac/lib/python3.9/site-packages/snakemake/executors/__init__.py", line 2362, in run_wrapper
Trying to restart job 8043.

I stopped the job because no sequence appears to be successfully downloaded. These errors were the same as those found by @Jacq-Elope and me before my supervisor find the incompatibility problems in python libraries (see issue 23), but now this is fixed I don't really get what I am doing wrong.

I hope this helps someone, and if there is a hint on what I am missing please let me know,

Thanks for everything,

ManuelPH