Russel88 / MAGinator

MAGinator - Accurate SNV calling and profiling of MAGs
MIT License
16 stars 1 forks source link

Paralellization #10

Open pabloati opened 10 months ago

pabloati commented 10 months ago

Hi, I would like to run MAGinator on a pretty large data set. I have around 420 samples, with 60 bins per sample on average, and the preprocessed reads are around 6GB each sample.

I have been running a subset of the samples (5) as a trial run on a cluster (40ppn and 180GB), and it has been running for more than 24 hours already.

Is there any possibility to run MAGinator in parallel to speed up the process? I am running the following command:

maginator -v trial/maginator_clusters.tsv \ -r trial/maginator_reads.csv \ -c trial/maginator_contigs.fasta \ -o trial/maginator \ -g /home/people/pablop/workdir/databases/gtdb_release207_v2 bin/run_maginator.sh (END)

Thank you, Pablo

Russel88 commented 9 months ago

Hi Pablo

If you're on a compute cluster the best way to speed it up is to use multiple nodes. So if you use the qsub system, this could be added to the maginator command: --cluster qsub --cluster_info "-l nodes=1:ppn={cores}:thinnode,mem={memory}gb,walltime={runtime}"

Can you see in the logs how far in the process maginator is? With 5 samples it shouldn't take that long.

pabloati commented 9 months ago

Hi Russel,

It was my bad that I didn't include those optionsat the beginning. However, I did it now, and it seems like it got stucked after the refinement step. This is the output from MAGINATOR's log.

ESC[36m[2023-11-14 11:56:38] INFO:ESC[0m Running MAGinator version 0.1.18 ESC[36m[2023-11-14 11:56:40] INFO:ESC[0m Filtering bins ESC[36m[2023-11-14 11:58:38] INFO:ESC[0m 297 bins in 76 VAMB clusters left after filtering ESC[36m[2023-11-14 11:58:38] INFO:ESC[0m Classifying genomes with GTDB-tk ESC[36m[2023-11-14 12:42:58] INFO:ESC[0m 76 clusters could be classified ESC[36m[2023-11-14 12:42:58] INFO:ESC[0m Clustering genes and parsing GTDB-tk results ESC[36m[2023-11-14 12:47:59] INFO:ESC[0m 76 VAMB clusters merged into 76 metagenomic species ESC[36m[2023-11-14 12:47:59] INFO:ESC[0m Filtering of the gene clusters and readmapping ESC[36m[2023-11-14 13:23:24] INFO:ESC[0m Identifying signature genes ESC[36m[2023-11-14 14:26:57] INFO:ESC[0m A total of 76 clusters are included in the analysis.

Russel88 commented 9 months ago

Can you post the log for the signature_gene workflow?

pabloati commented 9 months ago

I have not been able to find that log. Should it be in the logs directory created by maginator?

I have been looking at your code, and the process stops at the rule refinement. I get the output file from that step, and the logs are there, indicating that there was no error, but the next rule (gene_counts) is never executed.