ComparativeGenomicsToolkit / hal2vg

Convert HAL to VG
MIT License
21 stars 2 forks source link

hal2vg segfault #62

Closed leleory closed 1 year ago

leleory commented 1 year ago

Hi Glenn and all,

I am not sure if I need to report this issue here or within the cactus repo (as here the latest version is 1.1.3, while the code from cactus 2.6.4 shows vg version 2.2). hal2vg --help hal2vg v2.2: Convert HAL alignment to handle graph

In any case I am trying to convert a 40G hal alignment into a graph with the following command: hal2vg Pigs_27-way_20230220.hal --progress --hdf5InMemory --noAncestors --refGenomes sus_scrofa_gca000003025v6 --inMemory --progress > Pigs_27-way_20230220_v264.pg

After the "pinching" of the nodes and "merging trivial segments" the command fails in the "converting" step and the only message I get is that hal2vg failed with "Segmentation fault".

I tested whether it is a memory issue. I run the command with allocations of 3Tb and 4Tb memory, but the maximum memory usage of hal2vg in both cases were 1.8Tb.

How can I test what is the issue here?

Thank you, Lel

SGE captured the attached output from the running script:

hal2vg.o31939394.txt

glennhickey commented 1 year ago

Hi. hal2vg has mostly been maintained for the pangenome pipeline. It uses an enormous amount of memory on progressive alignments, and I honestly don't think it could handle a 40G hal file in 4TB RAM. I understand this conflicts a bit with your 1.8TB max memory number, though perhaps there's a 32bit size getting overrun somewhere.

In the minigraph-cactus paper I did run hal2vg on a progressive fly alignment as a point of comparison. The result was a bit of a mess though, and I don't think that vg offered any benefit over the hal for that data set.

Anyway, in summary:

leleory commented 1 year ago

Beside the progressive cactus (PC) based alignment we do have the minigraph-cactus (MC) outputs as well. The reason to get both sets was primarily because, beside pigs, we also included outgroup species in the analysis. The aim was to look for differences in called structural variants between the PC and the MC based alignments of this "hybrid pangenome". We wanted to use hal2vg and vg deconstruct to obtain the variants from the PC alignment. Assuming hal2vg would run through with allocating enough memory, could this method introduce errors in the variant calls? Would the Seqwish based approach (mentioned in the pangenome paper and in the Cactus github page) work if hal2vg fails to run (keeping in mind shortcomings with SNP calls.) Is it recommended to use the MC results with a pangenome which includes outgroup species?

glennhickey commented 1 year ago

That all sounds reasonable:

leleory commented 1 year ago

Thank you for the suggestions Glenn. I will try the various approaches.