NBISweden / aMeta

Ancient microbiome snakemake workflow
MIT License
19 stars 14 forks source link

aMeta is loading KU DB individually for each sample #138

Open julienfumey opened 11 months ago

julienfumey commented 11 months ago

Hi,

From the KU rules, It looks like the aMeta is loading KU DB individually for each samples.

However, KU has an option for loading first the DB in the working node and then use this loaded DB for every sample in subsequent commands in the same node. As the DB loading is very time consuming it might be useful to give this option. The PROs of this option is that it runs much faster each sample, the CONS is that it will run a very long time in the same node. This is why it should be an option to chose, but not a default configuration.

For info, the command we use in the lab:

# Load the DB:
krakenuniq \
--db ku.full/ \
--preload --threads 30

# Process fastq files
for sample in `cat samples.ids | cut -f 1`; do \
files=$(cat samples.ids | awk -v s=$sample '$1 == s' | cut -f 2) ; \
krakenuniq \
--db ku.full/ \
--threads 30 \
--report-file ${wdir}/03_krakenuniq/${sample}.tax.report.tsv.gz \
--gzip-compressed \
--fastq-input ${files} ; done
NikolayOskolkov commented 11 months ago

@julienfumey You are right, currently aMeta loads a KrakenUniq DB individually for each fastq-file. There were two reasons why we decided to go this way. First, we tried to make each sample go its own way, i.e. be placed on its own node because it suited the structure of our Swedish computer cluster (Uppmax, big bioinformatics-friendly cluster with high number of compute nodes). This however might not be optimal for other labs / clusters where not that many nodes are available. In addition, as you mentioned, it might be faster to load the DB once and process all the samples at one node. Here comes the second reason: from my testing at that time (around 2020) I was able to run a loop, that loads a KrakenUniq DB once and processes all fastq-files, in one HPC but not on another, for unclear reasons. My feeling is that KrakenUniq maybe sensitive to certain configurations of HPCs. However, I would be very interested in revisiting this with the latest and improved (low-memory) KrakenUniq development. Thank you for posting the command line! Let me test it and get back to you.