ComparativeGenomicsToolkit / cactus

Official home of genome aligner based upon notion of Cactus graphs
Other
522 stars 111 forks source link

Minimal configuration for 100 genomes mammal size #1512

Open fgualdr opened 2 weeks ago

fgualdr commented 2 weeks ago

Hi, I am struggling to workout and estimate minimal resources to run cactus for 100 genomes mammal size. I can use CPUs, GPUs but I need to estimate amount of resource for an amount of time (projects in my institutes are run with projects and resources allocation). Could I have a rule of thumb given that I can access NVIDIA GPU V100 Thanks in advance F

glennhickey commented 2 weeks ago

That's a good question. I'd recommend using cactus-prepare to break the alignment into individual commands. Only the cactus-blast commands can use GPUs. You can with and without GPU for the first couple jobs to see what the difference is on your system.

The advanatage of running this way (as opposed to a single run of cactus) is that it's much easier to explore different options for different subtasks, and resume/rerun things as needed.

I'm going to start on a large alignment soon for the vertebrate genomes project and have the same questions as you, and plan on beginning with something like the above approach. I will post details as I go which will hopefully help serve as a guide in the future..

But in the meantime, the runtime is very data (divergence, assembly quality) and system dependent, so it's hard to give you any precise numbers...

zwh82 commented 1 week ago

@glennhickey Hi, I’m facing a similar issue and hope to get your help. I’m planning to construct a pangenome for hundreds or even thousands of bacterial genomes within a species. These genomes are fairly similar, with complete genome assemblies and an average nucleotide similarity of over 95%.

I previously tried using PGGB for pangenome construction, but it requires pairwise alignment due to being reference-free, which my server cannot handle. Additionally, the pangenome constructed by PGGB seems to have certain snarls, making the indexing process for vg giraffe extremely slow. As a result, I had to give up on PGGB.

Currently, as a novice, I’m exploring the MC. I tested it on five complete bacterial genomes to build a pangenome. However, with 64 cores specified, it took 20 minutes, with only 6% CPU utilization, whereas PGGB only took 7 minutes. By the way, the docker of cactus has been downloaded locally.

So, my questions are:

1) Can MC build pangenomes for hundreds or thousands of bacterial genomes? 2) Is there an issue with my MC command? Is the runtime and CPU utilization normal?

    Command being timed: "cactus-pangenome ./js ../genomes/genome2id.tsv --outDir ./ --outName mc --reference GCF_022343785 --maxCores 64"
    User time (seconds): 64.55
    System time (seconds): 18.18
    Percent of CPU this job got: 6%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 20:58.31
    Average shared text size (kbytes): 0
    Average unshared data size (kbytes): 0
    Average stack size (kbytes): 0
    Average total size (kbytes): 0
    Maximum resident set size (kbytes): 114284
    Average resident set size (kbytes): 0
    Major (requiring I/O) page faults: 8
    Minor (reclaiming a frame) page faults: 2968471
    Voluntary context switches: 455161
    Involuntary context switches: 24874
    Swaps: 0
    File system inputs: 752216
    File system outputs: 2380504
    Socket messages sent: 0
    Socket messages received: 0
    Signals delivered: 0
    Page size (bytes): 4096
    Exit status: 0

Best, wenhai

glennhickey commented 1 week ago

MC runtime is dominated by minigraph construction. I recommend trying, say, about 100 genomes. Check your log to see how long each minigraph -xggs call takes (it runs in batches of 50). You can estimate the per-genome runtime by dividing the time required for minigraph -xggs by the number of input genomes.

I strongly recommend reading this entire section of the documentaiton when determining the parameters to use.

About CPU usage, cactus-pangenome does very little work apart from launching other processes. As such, you cannot use time to estimate CPU utilization. The logs contain how much time and memory each process used, and you can also get stats as described here.

zwh82 commented 1 week ago

Thanks.