`vg msga` quite slow with 20 1Mb genomes

bredelings commented 3 years ago

Hi,

I have been using FSA to align about 20 de-novo-assembled genomes that are about 1MB long. They are from different individuals in the same species. FSA runs, but it takes tons of memory, and the resulting alignments have a lot of problems, especially near repeat regions.

An algorithm that used POA (or even better, a tree that was allowed to change across the sequence) would probably give a lot better results.

I just tried vg msga. My goal would be to dump the *.vg file back to FASTA after the graph is generated.

vg msga -f 2x-LT635612.fasta -b PVP01-LT635612 -B 128    -D -t 6 > test.vg

However it seems to slow down exponentially (or perhaps only polynomially) in the number of sequences. The first few sequences took a few hours, but it has been stuck on the 11th sequence for about a day.

Is the speed of vg msga something that is being worked on? Are there recommendations for how I should go about this?

Any guidance is appreciated!

thanks,

-BenRI

bredelings commented 3 years ago

This is the output for the last successful step:

ERR2679005-2x-LT635612: adding to graph 10/21
ERR2679005-2x-LT635612: aligning 1010808bp -> g:1696377bp n:160132 e:205746
ERR2679005-2x-LT635612: editing graph
ERR2679005-2x-LT635612: sorting and compacting ids
building xg index
building GCSA2 index
InputGraph::InputGraph(): 5922208 kmers in 1 file(s)
InputGraph::read(): Read 5922208 16-mers from /tmp/vg-dO4r33/vg-kmers-tmp-fkfDJ0
InputGraph::readKeys(): 3756478 unique keys
InputGraph::read(): Read 5922208 16-mers from /tmp/vg-dO4r33/vg-kmers-tmp-fkfDJ0
InputGraph::readFrom(): 3392933 unique start nodes
InputGraph::read(): Read 5922208 16-mers from /tmp/vg-dO4r33/vg-kmers-tmp-fkfDJ0
PathGraph::PathGraph(): 5922208 paths with 11844416 ranks
PathGraph::PathGraph(): 0.176496 GB in 1 file(s)
GCSA::GCSA(): Preprocessing: 1.1579 seconds, 4.94569 GB
GCSA::GCSA(): Prefix-doubling from path length 16
GCSA::GCSA(): Step 1 (path length 16 -> 32)
PathGraph::prune(): 5922208 -> 3927014 paths (3394850 ranges)
PathGraph::prune(): 3243730 unique, 0 redundant, 359884 unsorted, 323400 nondeterministic paths
PathGraph::prune(): 0.117034 GB in 1 file(s)
PathGraph::read(): File 0: Read 3927014 order-16 paths
PathGraph::extend(): File 0: Created 4060742 order-32 paths
PathGraph::read(): File 0: Read 4060742 order-32 paths
PathGraphBuilder::sort(): File 0: Sorted 4060742 paths
PathGraph::extend(): 3927014 -> 4060742 paths (8615096 ranks)
PathGraph::extend(): 0.122858 GB in 1 file(s)
GCSA::GCSA(): Step 2 (path length 32 -> 64)
PathGraph::prune(): 4060742 -> 4004742 paths (3584748 ranges)
PathGraph::prune(): 3491525 unique, 0 redundant, 130696 unsorted, 382521 nondeterministic paths
PathGraph::prune(): 0.120981 GB in 1 file(s)
PathGraph::read(): File 0: Read 4004742 order-32 paths
PathGraph::extend(): File 0: Created 4151065 order-64 paths
PathGraph::read(): File 0: Read 4151065 order-64 paths
PathGraphBuilder::sort(): File 0: Sorted 4151065 paths
PathGraph::extend(): 4004742 -> 4151065 paths (9367389 ranks)
PathGraph::extend(): 0.12768 GB in 1 file(s)
GCSA::GCSA(): Step 3 (path length 64 -> 128)
PathGraph::prune(): 4151065 -> 4138989 paths (3666349 ranges)
PathGraph::prune(): 3556087 unique, 0 redundant, 135245 unsorted, 447657 nondeterministic paths
PathGraph::prune(): 0.1272 GB in 1 file(s)
PathGraph::read(): File 0: Read 4138989 order-64 paths
PathGraph::extend(): File 0: Created 9714161 order-128 paths
PathGraph::read(): File 0: Read 9714161 order-128 paths
PathGraphBuilder::sort(): File 0: Sorted 9714161 paths
PathGraph::extend(): 4138989 -> 9714161 paths (59219924 ranks)
PathGraph::extend(): 0.43774 GB in 1 file(s)
GCSA::GCSA(): Prefix-doubling: 16.9 seconds, 4.94569 GB
GCSA::GCSA(): Merging the paths
MergedGraph::MergedGraph(): 3664810 paths with 8107893 ranks and 378171 additional start nodes
MergedGraph::MergedGraph(): 0.121167 GB
GCSA::GCSA(): Merging: 2.22692 seconds, 4.94569 GB
GCSA::GCSA(): Building the index
GCSA::GCSA(): Construction: 2.95495 seconds, 4.94569 GB
GCSA::GCSA(): 3664810 paths, 3753295 edges
GCSA::GCSA(): 4042981 pointers (650048 redundant)
GCSA::GCSA(): 633681 samples at 303585 positions
LCPArray::LCPArray(): Construction: 0.0936513 seconds, 4.94569 GB
LCPArray::LCPArray(): 3722983 values at 5 levels (branching factor 64)
[vg msga] : min_mem_length = 16, mem_reseed_length = 24, min_cluster_length = 0
ERR2679008-2x-LT635612: adding to graph 11/21
ERR2679008-2x-LT635612: aligning 1010546bp -> g:1696458bp n:160273 e:210530

ekg commented 3 years ago

vg msga is a kind of exploration of progressive graph construction. It's not good for anything more than a single gene and tens of input sequences. The main use has been testing.

I'd you want to build a lossless/vg-type pangenome graph efficiently, I'd suggest working with the pggb pipeline. github.com/pangenome/pggb. It's still in very active development but approaching a stable version.

ekg commented 3 years ago

With pggb you can force full-length alignment of those sequences (-N --no-split), resulting in a graph that is overall linear (at least as linear as the sequences suggest). There might still be collapse at VNTRs but this will be a local thing.

(Note that this option will be in pggb in the next day. The segment length option -s can do a similar thing.)

bredelings commented 3 years ago

Thanks! I will try it later this week.

ekg / seqwish

`vg msga` quite slow with 20 1Mb genomes #68