hal2vg segfault - Githubissues

leleory commented 1 year ago

Hi Glenn and all,

I am not sure if I need to report this issue here or within the cactus repo (as here the latest version is 1.1.3, while the code from cactus 2.6.4 shows vg version 2.2). hal2vg --help hal2vg v2.2: Convert HAL alignment to handle graph

In any case I am trying to convert a 40G hal alignment into a graph with the following command: hal2vg Pigs_27-way_20230220.hal --progress --hdf5InMemory --noAncestors --refGenomes sus_scrofa_gca000003025v6 --inMemory --progress > Pigs_27-way_20230220_v264.pg

After the "pinching" of the nodes and "merging trivial segments" the command fails in the "converting" step and the only message I get is that hal2vg failed with "Segmentation fault".

I tested whether it is a memory issue. I run the command with allocations of 3Tb and 4Tb memory, but the maximum memory usage of hal2vg in both cases were 1.8Tb.

How can I test what is the issue here?

Thank you, Lel

SGE captured the attached output from the running script:

hal2vg.o31939394.txt

glennhickey commented 1 year ago

Hi. hal2vg has mostly been maintained for the pangenome pipeline. It uses an enormous amount of memory on progressive alignments, and I honestly don't think it could handle a 40G hal file in 4TB RAM. I understand this conflicts a bit with your 1.8TB max memory number, though perhaps there's a 32bit size getting overrun somewhere.

In the minigraph-cactus paper I did run hal2vg on a progressive fly alignment as a point of comparison. The result was a bit of a mess though, and I don't think that vg offered any benefit over the hal for that data set.

Anyway, in summary:

hal2vg is known to fail on large progressive alignments. I do want to eventually fix this, but it won't happen any time too soon.
Even if it ran through, I think the output would probably be of little value
If you want to make a pangenome of 27 pigs with Cactus, you'll have to use Minigraph-Cactus.

leleory commented 1 year ago

Beside the progressive cactus (PC) based alignment we do have the minigraph-cactus (MC) outputs as well. The reason to get both sets was primarily because, beside pigs, we also included outgroup species in the analysis. The aim was to look for differences in called structural variants between the PC and the MC based alignments of this "hybrid pangenome". We wanted to use hal2vg and vg deconstruct to obtain the variants from the PC alignment. Assuming hal2vg would run through with allocating enough memory, could this method introduce errors in the variant calls? Would the Seqwish based approach (mentioned in the pangenome paper and in the Cactus github page) work if hal2vg fails to run (keeping in mind shortcomings with SNP calls.) Is it recommended to use the MC results with a pangenome which includes outgroup species?

glennhickey commented 1 year ago

That all sounds reasonable:

If hal2vg somehow worked, I'd still be a bit skeptical of hal2vg -> vg deconstruct for progressive alignments, but mostly because it hasn't been tested. I don't think there's a fundamental limitation. We've improved the chaining in progressive cactus quite a bit since the paper, and vg deconstruct can now handle reference loops in order to support PGGB graphs.
seqwish scales much better than hal2vg, so I think it would work. If you went hal2paf to get an alignment for each branch, you could feed the results into seqwish. You would indeed lose SNPs since seqwish only considers exact matches. So if you have a A->C->A mutation along two branches, the two As would be left uncollapsed (you could probably recover many of these alignments with gfaffix).
I've never used it, but there's a chain->paf converter https://github.com/AndreaGuarracino/chain2paf. So you could run cactus-hal2chains to make a chain file of each genome against a reference. Convert each of those to paf, then run seqwish then gfaffix then deconstruct on that same reference.
You might also consider PGGB which does an all-to-all mapping before seqwish.
Adding outgroups to MC pangenomes is very high on my list of things to do. I can't recommend it one way or the other yet, but it's definitely something we'll be working on in the next weeks.

leleory commented 1 year ago

Thank you for the suggestions Glenn. I will try the various approaches.

ComparativeGenomicsToolkit / hal2vg

hal2vg segfault #62