Open TakacsBertalan opened 2 years ago
Format 102 will do a simple LCA taxonomic annotation of each read, and it's the only way to directly do taxonomic classification with Diamond without using any postprocessing. If you merge query files the output will be dumped together into a single file, so you would have to separate that again by sample after the Diamond run.
Hi! In short: I have 26 paired-end metagenomic samples, with varying read counts between 500000 and 2000000 and I would like to classify them with DIAMOND. If I merge these queries to optimize the running time, how will that influence the metagenomic classification? Will I be able to extract the abundancies for each sample, if I use the 102 output format?
In more detail: I would like to use DIAMOND for classification of metagenomic data. I am running DIAMOND on a PC, using 12 threads and 64 GBs of RAM and the sample I've tested the software on has 10 million reads. It took around 1 day to run the forward reads of this sample with the following command.
diamond blastx --verbose -p 10 --outfmt 102 -d DIAMOND/diamond_nr_database.dmnd -q datasets/sajat_100_faj_10m_1.fastq.gz -o output/diamond_sajat_100_faj_10m_1 --taxonmap DIAMOND/taxdmp/ --taxonnodes DIAMOND/taxdmp/nodes.dmp
It seems like the step that takes up most time is "Loading reference sequences" which often takes around 5000 seconds. Because I have 25 other similar samples (so 52 fastqs in total), I would like to minimize the running time. I found this issue https://github.com/bbuchfink/diamond/issues/545 and here you suggested merging the queries. I am wondering how merging the 52 files together will influence the final result, will I be still able to extract the abundancies for each sample, or will everything be just dumped together? I use the 102 output format, because that seemed the most straight-forward for me, but is it the correct solution?