Each bottom segment has a pointer slot for each child in the tree. As such, the size of each segment grows with the number of children. HAL/Cactus were originally designed for phylogenetic trees which, to serve as guide trees at least, have very small degree.
But the Minigraph-Cactus pangenome pipeline repurposes HAL a bit to store 1-level trees with large degree (equal to the number of input genomes). And it turns out that if there are enough genomes (around 500), then the BottomSegment hdf5 data type, as originally implemented, exceeds a hard limit in HDF5's header structure.
And we're finally at the point now where the HPRC is releasing more than 500 genomes, meaning this issue completely prevents Minigraph-Cactus....
Luckily, going over the notes from the old issues, it looks like there's a fairly simple fix in re-organizing the bottom segment datatype to store the child indexes in an array. And the really nice surprise is that this array can be laid equivalently to the old, explicit, representation so that there are no breaks in compatibility, at least on small test datasets.
So in summary, this PR revises the bottom segment data type to use an array for the child indexes, rather than explicitly setting a subfield for each. But the layout on disk remains identical. This should in theory increase support to arbitrary numbers of children without affecting compatibility with old HAL files.
Each bottom segment has a pointer slot for each child in the tree. As such, the size of each segment grows with the number of children. HAL/Cactus were originally designed for phylogenetic trees which, to serve as guide trees at least, have very small degree.
But the Minigraph-Cactus pangenome pipeline repurposes HAL a bit to store 1-level trees with large degree (equal to the number of input genomes). And it turns out that if there are enough genomes (around 500), then the
BottomSegment
hdf5 data type, as originally implemented, exceeds a hard limit in HDF5's header structure.And we're finally at the point now where the HPRC is releasing more than 500 genomes, meaning this issue completely prevents Minigraph-Cactus....
Luckily, going over the notes from the old issues, it looks like there's a fairly simple fix in re-organizing the bottom segment datatype to store the child indexes in an array. And the really nice surprise is that this array can be laid equivalently to the old, explicit, representation so that there are no breaks in compatibility, at least on small test datasets.
So in summary, this PR revises the bottom segment data type to use an array for the child indexes, rather than explicitly setting a subfield for each. But the layout on disk remains identical. This should in theory increase support to arbitrary numbers of children without affecting compatibility with old HAL files.
Resolves #212.