ComparativeGenomicsToolkit / cactus

Official home of genome aligner based upon notion of Cactus graphs
Other
499 stars 109 forks source link

The variation graph generated by the Minigraph-Cactus Pangenome Pipeline cannot be indexed as GCSA if there are too many genomes #740

Open Sunhh opened 2 years ago

Sunhh commented 2 years ago

I am working on a graph-based pan-genome construction for ten genomes from the same crop whose genome size is less than 500 Mb. When following the Minigraph-Cactus Pangenome Pipeline to construct the variation graph, I am able to generate the .vg format variation graph after finishing "cactus-graphmap-join" command, but I cannot finish the GCSA index from this graph. Initially, I met an error message saying the temporary space is over 2T limitation, which was solved by adding a pruning step with "vg prune". But now, the "vg index" continuously fails after throwing the log information "PathGraph::extend(): File 0: Created 8747360512 order-256 paths" "without coredump", dying with signal 9. May I ask what the problem could be or what I can do to fix this? Thank you and look forward to your help!

The genome I am working on is not very complicated, with 60% of it annotated as repeats which have been changed to lower cases in the input fasta files. The number seems high because it is assembled with HiFi reads. Besides, I can get the GCSA index if only six genomes are included, but I always fail when using seven genomes.

The commands I used to construct the pan-genome are listed below: cactus-minigraph ./jobstore w-pg/w.pg.txt w-pg/w.gfa.gz --realTimeLogging --reference W97 --mapCores 80 cactus-graphmap ./jobstore w-pg/w.pg.txt w-pg/w.gfa.gz w-pg/w.paf --realTimeLogging --reference W97 --outputFasta w-pg/w.gfa.fa.gz --mapCores 80 cactus-align ./jobstore w-pg/w.pg.txt w-pg/w.paf w-pg/w.hal --pangenome --pafInput --outVG --reference W97 --realTimeLogging cactus-graphmap-join ./jobstore --vg w-pg/w.vg --outDir ./w-pg --outName w-pg --reference W97 --vcf --giraffe --gfaffix --wlineSep "." --realTimeLogging vg convert -v w-pg/w-pg.xg > w-pg/w-pg.vg mkdir w-pg/temp/ vg index -x w-pg/new-w-pg.xg w-pg/w-pg.vg -t 60 -p --temp-dir w-pg/temp/ vg gbwt -x w-pg/new-w-pg.xg -o w-pg/w-pg.gbwt -v w-pg/w-pg.vcf.gz -p --temp-dir w-pg/temp/ vg prune -u -g w-pg/w-pg.gbwt -m w-pg/node_mapping w-pg/w-pg.vg -p -M 32 -t 100 > w-pg/w-pg.pruned.vg vg index -g w-pg/w-pg.gcsa -f w-pg/node_mapping w-pg/w-pg.pruned.vg -p --temp-dir w-pg/temp/ -t 60

The logging message are below: GCSA::GCSA(): Step 3 (path length 64 -> 128) PathGraph::prune(): 988860002 -> 977822179 paths (794195377 ranges) PathGraph::prune(): 751730745 unique, 0 redundant, 210455666 unsorted, 15635768 nondeterministic paths PathGraph::prune(): 34.2304 GB in 1 file(s) PathGraph::read(): File 0: Read 977822179 order-64 paths PathGraph::extend(): File 0: Created 1214382892 order-128 paths PathGraph::read(): File 0: Read 1214382892 order-128 paths PathGraphBuilder::sort(): File 0: Sorted 1214382892 paths PathGraph::extend(): 977822179 -> 1214382892 paths (6180215938 ranks) PathGraph::extend(): 50.1667 GB in 1 file(s) GCSA::GCSA(): Step 4 (path length 128 -> 256) PathGraph::prune(): 1214382892 -> 1174714728 paths (935955192 ranges) PathGraph::prune(): 869797981 unique, 0 redundant, 266991520 unsorted, 37925227 nondeterministic paths PathGraph::prune(): 47.9335 GB in 1 file(s) PathGraph::read(): File 0: Read 1174714728 order-128 paths PathGraph::extend(): File 0: Created 8747360512 order-256 paths Child died with signal 9, without coredump.

glennhickey commented 2 years ago

A few things

Some relevant vg issues: https://github.com/vgteam/vg/issues/3339 https://github.com/vgteam/vg/issues/3411 https://github.com/vgteam/vg/issues/3547 https://github.com/vgteam/vg/issues/3485

Sunhh commented 2 years ago

A few things

  • You will note that GCSA indexing is not part of the Minigraph-Pangenome pipeline, nor is it mentioned as being supported anywhere in the documentation. So I don't think the title of this issue is fair (claiming the pipeline doesn't work).
  • The steps you used to make a GBWT are wrong (you should use the one output by cactus-graphmap-join), but I don't think that'd make a difference for this crash.
  • Your only chance to make a GCSA is probably to run an allele frequency filter as described here: https://github.com/ComparativeGenomicsToolkit/cactus/blob/master/doc/pangenome.md#hprc-graph-filtering-by-allele-frequency

Some relevant vg issues: vgteam/vg#3339 vgteam/vg#3411 vgteam/vg#3547 vgteam/vg#3485

Thank you @glennhickey for your informative reply! Yes, you are right and I've modified the title to make my problem clear. I'll check the discussions you posted here and provide feedback later, but may I ask if there is another way to finish the variant calling through reads mapping with this cactus pipeline generated variation graph as the reference? "vg giraffe"?

In fact, I also tried to simply generate a VCF file for SVs and then use "vg construct" to build the pan-genome, which worked for GCSA indexing. I believe the "Minigraph-Cactus Pangenome Pipeline" is more sophisticated and accurate, and that's why I want to rely on this pipeline.

Thanks!

glennhickey commented 2 years ago

"vg giraffe"?

Exactly. cactus-graphmap-join produces all indexes needed for vg giraffe. You can find more about giraffe here: https://www.science.org/doi/10.1126/science.abg8871

I recommend filtering out private variants to improve mapping accuracy as described here (but use -d 2 instead of -d 9 given your small number of samples).

Sunhh commented 2 years ago

@glennhickey Hi Glenn, I've successfully finished the GCSA indexing after filtering the "--vgClipOpts" option according to your suggestion and those discussions. Thank you for your help! However, I've met a related problem. Noticing the allele frequency filtering comes from the HPRC Graph construction, I tried to follow the "Filtering Complex Regions and Indexing for Giraffe" step of HPRC Graph" construction too. In this step, the "cactus-graphmap-join" program required "the vg files created by the previous call of graphmap-join", but I couldn't find a .vg output file from the previous cactus-graphmap-join run. After checking the log files of cactus-align run, I determined to use hal2vg to generate one for use. May I as if my modification in the construction is correct? Thank you!

After having w-pg/w.vg and w-pg/w.hal from running "cactus-align".

cactus-graphmap-join ./jobstore --vg w-pg/w.vg --hal w-pg/w.hal --outDir ./w-pg --outName w-full --reference W97 --giraffe --gfaffix --wlineSep "." --realTimeLogging hal2vg w-pg/w-full.hal --progress --hdf5InMemory --ignoreGenomes Anc0 > w-pg/w-full.vg cactus-graphmap-join ./jobstore --vg w-pg/w-full.vg --hal w-pg/w-full.hal --outDir ./w-pg/ --outName w-clip --reference W97 --wlineSep "." --clipLength 10000 --clipNonMinigraph --vcf --giraffe --preserveIDs --realTimeLogging hal2vg w-pg/w-clip.hal --progress --hdf5InMemory --ignoreGenomes Anc0 > w-pg/w-clip.vg cactus-graphmap-join ./jobstore --vg w-pg/w-clip.vg --hal w-pg/w-clip.hal --outDir ./w-pg/ --outName w-minaf.0.1 --reference W97 --vgClipOptions "-d 2 -m 10000" --preserveIDs --giraffe --nodeStorage 1000 --realTimeLogging

glennhickey commented 2 years ago

cactus-graphmap-join always produces vg output. It's in <outDir>/clip-<outName>/.

Sunhh commented 2 years ago

Yes, now I find it! Thank you very much!