Open johnlees opened 1 year ago
Talking to Tommi, he suggests that better compression and working on the compressed structure may be preferable. Some ideas:
Tommi suggests:
See branch const-refactor
for a prototype
But for this to work properly it needs to either be bit packed and not many vectors, or a sparse array
The current .skf format has the following issues:
This would be a large and breaking change. I think if we do this, it would make sense to have a second file format (.ska/.sklarge/.h5?) which can be selected by the user for these cases, and continue to allow the previous .skf format to work as the default.
File format alternatives
For the file format, HDF5 comes to mind (which uses btrees too), but the rust bindings require the library to be installed, which is the main disadvantage (but there is good support for building it). Range queries/slices of datasets can be returned, and it's easy to add attributes to the same file. So it definitely fits the bill here.
An alternative would be Apache Parquet which has a native rust implementation, and snap compression. This would be suitable for the kmers and variants array, but it would make more sense to store the other fields (version, k size etc) with serde as before. To keep these together as a single file, could we just use tar? Starting to feel too complex imo.
Streaming from disk
For the second issue, blocks of btrees by range, and careful merging, could allow streaming of the relevant parts during build & align phases. For example see https://github.com/wspeirs/btree
Both file formats above would work here. Arrow can read and write blocks of rows. HDF5 can take a slice/range query.
Other notes
I feel like serde should be capable of at least some of this, see e.g. https://serde.rs/stream-array.html and https://serde.rs/ignored-any.html. But intial attempts with the current format weren't working well and I'm not sure why, so if it needs to be changed anyway introducing a format more designed for streaming operations might be sensible.