Question: Merging queries for metagenomic classification

Hi! In short: I have 26 paired-end metagenomic samples, with varying read counts between 500000 and 2000000 and I would like to classify them with DIAMOND. If I merge these queries to optimize the running time, how will that influence the metagenomic classification? Will I be able to extract the abundancies for each sample, if I use the 102 output format?

In more detail: I would like to use DIAMOND for classification of metagenomic data. I am running DIAMOND on a PC, using 12 threads and 64 GBs of RAM and the sample I've tested the software on has 10 million reads. It took around 1 day to run the forward reads of this sample with the following command. diamond blastx --verbose -p 10 --outfmt 102 -d DIAMOND/diamond_nr_database.dmnd -q datasets/sajat_100_faj_10m_1.fastq.gz -o output/diamond_sajat_100_faj_10m_1 --taxonmap DIAMOND/taxdmp/ --taxonnodes DIAMOND/taxdmp/nodes.dmp It seems like the step that takes up most time is "Loading reference sequences" which often takes around 5000 seconds. Because I have 25 other similar samples (so 52 fastqs in total), I would like to minimize the running time. I found this issue https://github.com/bbuchfink/diamond/issues/545 and here you suggested merging the queries. I am wondering how merging the 52 files together will influence the final result, will I be still able to extract the abundancies for each sample, or will everything be just dumped together? I use the 102 output format, because that seemed the most straight-forward for me, but is it the correct solution?

bbuchfink / diamond

Question: Merging queries for metagenomic classification #559