JensUweUlrich / Taxor

Fast and space-efficient taxonomic classification of long reads
BSD 3-Clause "New" or "Revised" License
43 stars 2 forks source link

Creating a large custom database from NCBI refseq #10

Open humbleflowers opened 2 months ago

humbleflowers commented 2 months ago

Hello, I am trying to build a database for taxor using ncbi refseq sequences from Prokaryotes, Archaea, Bacteria, Virus and Fungii (almost 50k orgamisms in total). taxor build --input-file ../../taxor/taxor_input.tsv --input-sequence-dir . --output-filename refseq-PBFAV-VK --kmer-size 22 --syncmer-size 12 --threads 30

i get this error

64      1.00    4.58    1.00    4.58    296.9GiB 128     0.96    4.48    1.01    4.52    299.6GiB 256     1.20    3.14    0.92    2.89    273.5GiB 512     1.71    3.84    0.87    3.34    257.8GiB Best t_max (regarding expected query runtime): 256 write Layout header layout created terminate called after throwing an instance of 'std::invalid_argument' what():  The size of the shape cannot be greater than the window size. Aborted (core dumped)

VERSION
    Last update:
    taxor-build version: 0.1.3
    SeqAn version: 3.4.0-rc.1

Thank you for the tool, the initial results are really good. @JensUweUlrich

humbleflowers commented 2 months ago

adding top few lines from input tsv for your reference here


GCF_900128725.1 9       https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/900/128/725/GCF_900128725.1_BCifornacula_v1.0      Buchnera aphidicola     d__Bacteria;p__Pseudomonadota;c__Gammaproteobacteria;o__Enterobacterales;f__Erwiniaceae;g__Buchnera;s__Buchnera aphidicola  131567;2;1224;1236;91347;1903409;32199;9
GCF_019599125.1 24      https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/019/599/125/GCF_019599125.1_ASM1959912v1   Shewanella putrefaciens d__Bacteria;p__Pseudomonadota;c__Gammaproteobacteria;o__Alteromonadales;f__Shewanellaceae;g__Shewanella;s__Shewanella putrefaciens  131567;2;1224;1236;135622;267890;22;24
GCF_016698685.1 34      https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/016/698/685/GCF_016698685.1_ASM1669868v1   Myxococcus xanthus      d__Bacteria;p__Myxococcota;c__Myxococcia;o__Myxococcales;f__Myxococcaceae;g__Myxococcus;s__Myxococcus xanthus       131567;2;2818505;32015;29;80811;31;32;34
GCF_000219105.1 35      https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/219/105/GCF_000219105.1_ASM21910v1     Corallococcus macrosporus       d__Bacteria;p__Myxococcota;c__Myxococcia;o__Myxococcales;f__Myxococcaceae;g__Corallococcus;s__Corallococcus macrosporus     131567;2;2818505;32015;29;80811;31;83461;35
GCF_002305875.1 43      https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/002/305/875/GCF_002305875.1_ASM230587v1    Cystobacter fuscus      d__Bacteria;p__Myxococcota;c__Myxococcia;o__Myxococcales;f__Archangiaceae;g__Cystobacter;s__Cystobacter fuscus      131567;2;2818505;32015;29;80811;39;42;43
GCF_001027285.1 48      https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/027/285/GCF_001027285.1_ASM102728v1    Archangium gephyra      d__Bacteria;p__Myxococcota;c__Myxococcia;o__Myxococcales;f__Archangiaceae;g__Archangium;s__Archangium gephyra       131567;2;2818505;32015;29;80811;39;47;48
GCF_001189295.1 52      https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/189/295/GCF_001189295.1_ASM118929v1    Chondromyces crocatus   d__Bacteria;p__Myxococcota;c__;o__Polyangiales;f__Polyangiaceae;g__Chondromyces;s__Chondromyces crocatus    131567;2;2818505;3031711;3031712;49;50;52
GCF_002950945.1 56      https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/002/950/945/GCF_002950945.1_ASM295094v1    Sorangium cellulosum    d__Bacteria;p__Myxococcota;c__;o__Polyangiales;f__Polyangiaceae;g__Sorangium;s__Sorangium cellulosum        131567;2;2818505;3031711;3031712;49;39643;56
GCF_022870945.1 61      https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/022/870/945/GCF_022870945.1_ASM2287094v1   Vitreoscilla stercoraria        d__Bacteria;p__Pseudomonadota;c__Betaproteobacteria;o__Neisseriales;f__Neisseriaceae;g__Vitreoscilla;s__Vitreoscilla stercoraria    131567;2;1224;28216;206351;481;59;61
GCF_002222655.1 63      https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/002/222/655/GCF_002222655.1_ASM222265v1    Vitreoscilla filiformis d__Bacteria;p__Pseudomonadota;c__Betaproteobacteria;o__Neisseriales;f__Neisseriaceae;g__Vitreoscilla;s__Vitreoscilla filiformis     131567;2;1224;28216;206351;481;59;63
humbleflowers commented 2 months ago

Hi @JensUweUlrich i identified the problem, i had duplicate taxids that why the build command failed. Now i fixed it.

After building the custom database successfully. search command runs fine but profile command fails

taxor search --index-file /taxor/refseq-PBFAV-VK-sub_k22_s12.hixf --query-file zymo.fastq.gz --error-rate 0.15 --threads 30 --output-file zymo.search.txt ; taxor profile --search-file zymo.search.txt --cami-report-file zy6322.cami_report_file15 --binning-file zy6322.binning15 --seq-abundance-file zy6322.seq_abund_file15 --sample-id zymo

the fail log


Starting EM iteration 2
Update prior probabilities ...Sum nts of matching reads per taxon...
done
final update...
done
done
Starting EM iteration 3
Update prior probabilities ...Sum nts of matching reads per taxon...
done
final update...
done
done
Starting EM iteration 4
Update prior probabilities ...Sum nts of matching reads per taxon...
done
final update...
done
done
Starting EM iteration 5
Update prior probabilities ...Sum nts of matching reads per taxon...
done
final update...
done
done
Starting EM iteration 6
Update prior probabilities ...Sum nts of matching reads per taxon...
done
final update...
done
done
Starting EM iteration 7
Update prior probabilities ...Sum nts of matching reads per taxon...
done
final update...
done
done
Starting EM iteration 8
Update prior probabilities ...Sum nts of matching reads per taxon...
done
final update...
done
done
Starting EM iteration 9
Update prior probabilities ...Sum nts of matching reads per taxon...
done
final update...
done
done
Number of EM steps needed: 10
done
Calculate higher rank sequence abundances..Segmentation fault (core dumped)
JensUweUlrich commented 2 months ago

@humbleflowers Thanks for pointing me at the issue with duplicated taxIDs. That's something I need to fix. How large is the output file of the search command? Could you provide the file for debugging?

humbleflowers commented 2 months ago

taxor search --index-file /taxor/refseq-PBFAV-VK-sub_k22_s12 --query-file ../ont_zymo_samples/ZYMO_D6311_14.fastq.gz --error-rate 0.05 --threads 30 --output-file zymo_d6311_14.search.txt ; taxor profile --search-file zymo_d6311.search.txt --cami-report-file zy.cami_report_file --binning-file zy.binning --seq-abundance-file zy.seq_abund_file --sample-id zymo_d6311_14

the search file is 712MB in size. I can provide the file. How can i facilitate it?

JensUweUlrich commented 2 months ago

Could you upload the file to the following dropbox folder?: Research Data Exchange

humbleflowers commented 2 months ago

Hello @JensUweUlrich, DropBox is restricted in my organisation. I am exploring other possibilities with internal data sharing software. Can you share me your email?

humbleflowers commented 1 month ago

Hello @JensUweUlrich, Its not possible to share data using dropbox. I will have to use Microsoft One drive of my organisation. It will be great if you share email address to facilitate it. Sorry for the incovenience.

JensUweUlrich commented 1 month ago

Hi @humbleflowers

Thanks for your support. You can use my gmail address, which is jensenuk83@gmail.com

humbleflowers commented 1 month ago

Hello @JensUweUlrich

I could only share it to your official email jens-uwe.ulrich[at]hpi.de in the publication due to my organization policies. You should have recieved my mail and the file download link. I hope it works for you. My apologies for any incovenience.

Thank you.

JensUweUlrich commented 1 month ago

Thanks, I successfully downloaded the file and found the issue immediately.

9346179b-ad67-4cbd-8281-377177a1672c runid=bf06a9565fd155984403a792f5e0017f6b6c9114 read=37332 ch=2084 start_time=2024-07-25T09:05:44.446938+04:00 flow_cell_id=PAY75906 protocol_group_id=20240724_ONT_METAG_24_IN_1_RPB_VAL2 sample_id=VAL2_24_in_1_RPB barcode=barcode21 barcode_alias=barcode21 parent_read_id=9346179b-ad67-4cbd-8281-377177a1672c basecall_model_version_id=dna_r10.4.1_e8.2_400bps_hac@v4.2.0      GCF_000146045.2 Saccharomyces cerevisiae S288C  4932    12157105        3185    267     112     d__Eukaryota;p__Ascomycota;c__Saccharomycetes;o__Saccharomycetales;f__Saccharomycetaceae;g__Saccharomyces;s__Saccharomyces cerevisiae S288C  131567;2759;33154;4751;451864;4890;716545;147537;4891;4892;4893;4930;4932

At the very end of that line, you can see the taxids of the full lineage. Here, the number of taxids is greater than the number of taxon strings in the field before. The problem is that taxids for subphylum, clade, subkingdom etc. are included in the refseq accession file, which is used for building the index. Did you use one of the prebuilt indexes I provided or did you create your own one?

humbleflowers commented 1 month ago

Hello @JensUweUlrich I created my own database using the build command as i mentioned in the post above. 'taxor build --input-file ../../taxor/taxor_input.tsv --input-sequence-dir . --output-filename refseq-PBFAV-VK --kmer-size 22 --syncmer-size 12 --threads 30'

Do i need to clean input tsv file to only include taxids at species level?

JensUweUlrich commented 1 month ago

Yes, you need to have exactly the same number of taxon ids and taxon names (the last two rows) in the input file describing your reference database. For ease of use I would stick to the classical ones species, genus, family, order, class, phylum, kingdom

humbleflowers commented 1 month ago

Thank you @JensUweUlrich, I will give it a try and get back to you.

humbleflowers commented 1 month ago

Hello @JensUweUlrich

What is the correct way to add this line in input file? My concern is about subspecies. In that scenario how should i add species and sub species. I have a line like this in my input file for subspecies of Treponema pallidum subsp. pallidum. if you notice i denote subspecies with s__ and i have 8 taxonomy ids while 7 taxonomy levels.

GCF_023016425.1 160 https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/023/016/425/GCF_023016425.1_ASM2301642v1 Treponema pallidum subsp. pallidum k__Bacteria;p__Spirochaetota;c__Spirochaetia;o__Spirochaetales;f__Treponemataceae;g__Treponema;s__Treponema pallidum subsp. pallidum 2;203691;203692;136;2845253;157;161;160

In nutshell, i want to know how to represent a subspecies like one in the example above.

JensUweUlrich commented 1 month ago

@humbleflowers In this scenario, the easiest way would be to just add the tax ID of the subspecies and not use the taxid of the species. The only drawback would be that the final abundance reports will summarize the abundance of the subspecies and not of the species level.