Open fgualdr opened 2 weeks ago
That's a good question. I'd recommend using cactus-prepare to break the alignment into individual commands. Only the cactus-blast
commands can use GPUs. You can with and without GPU for the first couple jobs to see what the difference is on your system.
The advanatage of running this way (as opposed to a single run of cactus
) is that it's much easier to explore different options for different subtasks, and resume/rerun things as needed.
I'm going to start on a large alignment soon for the vertebrate genomes project and have the same questions as you, and plan on beginning with something like the above approach. I will post details as I go which will hopefully help serve as a guide in the future..
But in the meantime, the runtime is very data (divergence, assembly quality) and system dependent, so it's hard to give you any precise numbers...
@glennhickey Hi, I’m facing a similar issue and hope to get your help. I’m planning to construct a pangenome for hundreds or even thousands of bacterial genomes within a species. These genomes are fairly similar, with complete genome assemblies and an average nucleotide similarity of over 95%.
I previously tried using PGGB
for pangenome construction, but it requires pairwise alignment due to being reference-free, which my server cannot handle. Additionally, the pangenome constructed by PGGB
seems to have certain snarls, making the indexing process for vg giraffe
extremely slow. As a result, I had to give up on PGGB
.
Currently, as a novice, I’m exploring the MC
. I tested it on five complete bacterial genomes to build a pangenome. However, with 64 cores specified, it took 20 minutes, with only 6% CPU utilization, whereas PGGB
only took 7 minutes. By the way, the docker of cactus
has been downloaded locally.
So, my questions are:
1) Can MC
build pangenomes for hundreds or thousands of bacterial genomes?
2) Is there an issue with my MC
command? Is the runtime and CPU utilization normal?
Command being timed: "cactus-pangenome ./js ../genomes/genome2id.tsv --outDir ./ --outName mc --reference GCF_022343785 --maxCores 64"
User time (seconds): 64.55
System time (seconds): 18.18
Percent of CPU this job got: 6%
Elapsed (wall clock) time (h:mm:ss or m:ss): 20:58.31
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 114284
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 8
Minor (reclaiming a frame) page faults: 2968471
Voluntary context switches: 455161
Involuntary context switches: 24874
Swaps: 0
File system inputs: 752216
File system outputs: 2380504
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
Best, wenhai
MC runtime is dominated by minigraph construction. I recommend trying, say, about 100 genomes. Check your log to see how long each minigraph -xggs
call takes (it runs in batches of 50). You can estimate the per-genome runtime by dividing the time required for minigraph -xggs
by the number of input genomes.
I strongly recommend reading this entire section of the documentaiton when determining the parameters to use.
About CPU usage, cactus-pangenome
does very little work apart from launching other processes. As such, you cannot use time
to estimate CPU utilization. The logs contain how much time and memory each process used, and you can also get stats as described here.
Thanks.
Hi, I am struggling to workout and estimate minimal resources to run cactus for 100 genomes mammal size. I can use CPUs, GPUs but I need to estimate amount of resource for an amount of time (projects in my institutes are run with projects and resources allocation). Could I have a rule of thumb given that I can access NVIDIA GPU V100 Thanks in advance F