PacificBiosciences / kineticsTools

Tools for detecting DNA modifications from single molecule, real-time sequencing data
19 stars 21 forks source link

Running ipdSummary faster? #79

Closed chenzixi07 closed 3 years ago

chenzixi07 commented 3 years ago

Hi all, Recently we assembled a 2.2G draft genome with ~8,000 scaffolds, and now I am trying to detect modifications with ipdSummary. The bam file is ~ 1T.

However, after running for 1 day, I found only 40 scaffolds were done. So I wonder if it is possible to make ipdSummary faster?

Here is the script I used, as I need the IPD ratio for each site for downstream analysis, I set --minCoverage 0.

ipdSummary mapped.bam.xml \ --log-level INFO \ --numWorkers 12 \ --reference genome.xml \ --pvalue 0.001 \ --minCoverage 0 \ --alignmentSetRefWindows \ --identify m4C,m6A \ --methylFraction \ --gff basemods.gff \ --csv basemods.csv

Thanks, Zixi

rhallPB commented 3 years ago

It is possible to parallelize the task by splitting the bam and running multiple instances of ipdSummary. Care should be taken with interpreting the results from ipdSummary on such a large genome, particularly with a highly fragmented assembly. The modification calls will be extremely noisy.

chenzixi07 commented 3 years ago

It is possible to parallelize the task by splitting the bam and running multiple instances of ipdSummary. Care should be taken with interpreting the results from ipdSummary on such a large genome, particularly with a highly fragmented assembly. The modification calls will be extremely noisy.

Thanks. Is there any alternative software for ipdSummary? Like MultiMotifMaker for MotifMaker, or any other non-official softwares.

Our cluster has two CPUs, with totally 48 threads. but only 128G mem, Running ipdSummary with --numWorkers 12 uses ~8G mem per thread, which uses most of the mems. It seems that we need more mem to run this job faster, rather than splitting the bam.

rhallPB commented 3 years ago

Unfortunately there is no alternative to ipdSummary, it is currently the only way to calculate the deviation of kinetics from the trained model (ipd ratio). The memory requirement will go down if you split the bam allowing you to use more cores per job. Or consider limiting the analysis to only the large contigs, results for the highly fragmented small contigs will not be optimal.

chenzixi07 commented 3 years ago

Unfortunately there is no alternative to ipdSummary, it is currently the only way to calculate the deviation of kinetics from the trained model (ipd ratio). The memory requirement will go down if you split the bam allowing you to use more cores per job. Or consider limiting the analysis to only the large contigs, results for the highly fragmented small contigs will not be optimal.

Thanks for your reply. I will try to split the bam.