MathiasEskildsen / ONT-AmpSeq

Snakemake workflow to generate OTU tables from barcoded ONT data
MIT License
5 stars 0 forks source link

Long minimap2 run times #6

Open JiriHosekAUDK opened 1 month ago

JiriHosekAUDK commented 1 month ago

Hi Mathias, I started to use the pipeline and analyzed successfully one nanopore dataset. In the second dataset I face quite high CPU time consumption at minimap step. Most of the barcodes need around 2000 CPU hours to complete and it is bit too much for us. Is it normal? I noticed the vsearch concatenated vsearch output enters the minimap as well and it has 13GB (it was around 2GB in my previous dataset). I am trying to split the dataset to smaller parts (it has over 90 barcodes). Do you think it helps? Does it affect the result? Thank you in advance for your comments. Best regards, Jiri

***** LOG FILE **** [Tue Oct 1 13:14:09 2024] rule mapping: input: output/vsearch/samples/concatenated_otus.fasta, output/vsearch/samples/barcode71_cluster.fasta output: output/mapping/samples/barcode71_aligned.sam log: logs/mapping/barcode71.log jobid: 541 reason: Missing output files: output/mapping/samples/barcode71_aligned.sam wildcards: sample=barcode71 threads: 192 resources: tmpdir=/scratch/45198183, mem_mb=40960, runtime=2880

Activating conda environment: .snakemake/conda/b4ad49d05807d28c69fc83ae01fc9768_ [M::mm_idx_gen::2.2881.16] collected minimizers [M::mm_idx_gen::2.4643.33] sorted minimizers [M::main::2.4713.33] loaded/built the index for 126454 target sequence(s) [M::mm_mapopt_update::2.4833.32] mid_occ = 33677 [M::mm_idx_stat] kmer size: 15; skip: 10; is_hpc: 0; #seq: 126454 [M::mm_idx_stat::2.4933.31] distinct minimizers: 707933 (69.15% are singletons); average occurrences: 46.680; average spacing: 5.500; total length: 181748707 [M::worker_pipeline::1947.862188.52] mapped 349972 sequences [M::worker_pipeline::2833.056188.74] mapped 348424 sequences [M::worker_pipeline::3775.738188.88] mapped 348993 sequences [M::worker_pipeline::5032.959189.29] mapped 349424 sequences [M::worker_pipeline::6093.459189.33] mapped 349777 sequences [M::worker_pipeline::7483.769189.39] mapped 350044 sequences [M::worker_pipeline::8940.199189.46] mapped 348921 sequences [M::worker_pipeline::10210.864189.49] mapped 350140 sequences [M::worker_pipeline::11411.147189.49] mapped 348067 sequences [M::worker_pipeline::12884.919189.52] mapped 349026 sequences [M::worker_pipeline::13613.981189.51] mapped 349228 sequences [M::worker_pipeline::14742.081189.50] mapped 349462 sequences [M::worker_pipeline::16232.916189.49] mapped 348060 sequences [M::worker_pipeline::18184.171189.54] mapped 348090 sequences [M::worker_pipeline::21198.651189.59] mapped 346896 sequences [M::worker_pipeline::22663.811189.62] mapped 349837 sequences [M::worker_pipeline::23677.641189.61] mapped 349724 sequences [M::worker_pipeline::24808.674189.62] mapped 349702 sequences [M::worker_pipeline::26612.533189.63] mapped 348792 sequences [M::worker_pipeline::28407.175189.65] mapped 348533 sequences [M::worker_pipeline::30634.108189.66] mapped 345479 sequences [M::worker_pipeline::32194.239189.66] mapped 349916 sequences [M::worker_pipeline::33154.673189.65] mapped 349490 sequences [M::worker_pipeline::34403.342189.65] mapped 349544 sequences [M::worker_pipeline::35559.824189.64] mapped 349904 sequences [M::worker_pipeline::36836.694*189.63] mapped 344231 sequences [M::main] Version: 2.26-r1175 [M::main] CMD: minimap2 -ax map-ont -K500M -t 192 --secondary=no output/vsearch/samples/barcode71_cluster.fasta output/vsearch/samples/concatenated_otus.fasta [M::main] Real time: 36836.936 sec; CPU: 6985490.047 sec; Peak RSS: 175.398 GB


MathiasEskildsen commented 1 month ago

Hi Jiri, I'm glad to hear that you at least successfully processed one dataset using ONT-AmpSeq. It is normal for minimap2 to be the bottleneck, when processing large datasets, as the clustering from the individual barcodes are getting mapped against the concatenated clusters.

However, I have a few questions to better assist you in processing this dataset in a timely manner:

How have you set your filtering thresholds, specifically your q-score? - Increasing this can lower the overall data size and ensure that only the best reads are getting used.

What kind of data are you working with? Is it from high or low diversity samples?

Best regards, Mathias

JiriHosekAUDK commented 1 month ago

Hi again, Thanks a lot for quick answer. I have set the q-score threshold to 18. I know the default was 20, but I calculated the statistics and the median of the sequence distribution over the q-score is around 20 and I wanted be sure to get most of data. But I see, there is maybe option to reduce the CPU time. These are high diversity samples. But to the previous question: is it important that individual barcodes are getting mapped against all the concatenated clusters? If I split the dataset the concatenated file would get smaller and the minimap faster, would it not? Best regards, Jiri

MathiasEskildsen commented 1 month ago

Hi,

I would recommend raising the q-score to range of 23-25, maybe even 28 if you still have a lot of reads in each barcode (>20k).

You can decide to subset your dataset and this will dramatically speed up minimap. However, you would not necessarily be able to compare your subsets. Furthermore, the relative abundance provided by Ampvis2 would only be relevant for the reads within your subsets.

It is difficult for me to say, what the correct course of action is in your case, as I do not know your research question, deadlines, resources etc. But I would start out by raising the q-score significantly if you are interested in using the relative abundance to compare all of your samples.

Best regards, Mathias

JiriHosekAUDK commented 1 month ago

Hi, Ok, thank you for now. I will restart with higher q-score limit and see. Best, Jiri