Closed leleory closed 1 year ago
Hi. hal2vg
has mostly been maintained for the pangenome pipeline. It uses an enormous amount of memory on progressive alignments, and I honestly don't think it could handle a 40G hal file in 4TB RAM. I understand this conflicts a bit with your 1.8TB max memory number, though perhaps there's a 32bit size getting overrun somewhere.
In the minigraph-cactus paper I did run hal2vg
on a progressive fly alignment as a point of comparison. The result was a bit of a mess though, and I don't think that vg offered any benefit over the hal for that data set.
Anyway, in summary:
hal2vg
is known to fail on large progressive alignments. I do want to eventually fix this, but it won't happen any time too soon. Beside the progressive cactus (PC) based alignment we do have the minigraph-cactus (MC) outputs as well. The reason to get both sets was primarily because, beside pigs, we also included outgroup species in the analysis. The aim was to look for differences in called structural variants between the PC and the MC based alignments of this "hybrid pangenome". We wanted to use hal2vg and vg deconstruct to obtain the variants from the PC alignment. Assuming hal2vg would run through with allocating enough memory, could this method introduce errors in the variant calls? Would the Seqwish based approach (mentioned in the pangenome paper and in the Cactus github page) work if hal2vg fails to run (keeping in mind shortcomings with SNP calls.) Is it recommended to use the MC results with a pangenome which includes outgroup species?
That all sounds reasonable:
hal2vg
somehow worked, I'd still be a bit skeptical of hal2vg
-> vg deconstruct
for progressive alignments, but mostly because it hasn't been tested. I don't think there's a fundamental limitation. We've improved the chaining in progressive cactus quite a bit since the paper, and vg deconstruct
can now handle reference loops in order to support PGGB graphs.seqwish
scales much better than hal2vg
, so I think it would work. If you went hal2paf
to get an alignment for each branch, you could feed the results into seqwish. You would indeed lose SNPs since seqwish
only considers exact matches. So if you have a A->C->A mutation along two branches, the two A
s would be left uncollapsed (you could probably recover many of these alignments with gfaffix
). cactus-hal2chains
to make a chain file of each genome against a reference. Convert each of those to paf, then run seqwish then gfaffix then deconstruct on that same reference.Thank you for the suggestions Glenn. I will try the various approaches.
Hi Glenn and all,
I am not sure if I need to report this issue here or within the cactus repo (as here the latest version is 1.1.3, while the code from cactus 2.6.4 shows vg version 2.2).
hal2vg --help
hal2vg v2.2: Convert HAL alignment to handle graphIn any case I am trying to convert a 40G hal alignment into a graph with the following command:
hal2vg Pigs_27-way_20230220.hal --progress --hdf5InMemory --noAncestors --refGenomes sus_scrofa_gca000003025v6 --inMemory --progress > Pigs_27-way_20230220_v264.pg
After the "pinching" of the nodes and "merging trivial segments" the command fails in the "converting" step and the only message I get is that hal2vg failed with "Segmentation fault".
I tested whether it is a memory issue. I run the command with allocations of 3Tb and 4Tb memory, but the maximum memory usage of hal2vg in both cases were 1.8Tb.
How can I test what is the issue here?
Thank you, Lel
SGE captured the attached output from the running script:
hal2vg.o31939394.txt