fjall-rs / fjall

LSM-based embeddable key-value storage engine written in safe Rust
https://fjall-rs.github.io/
Apache License 2.0
320 stars 13 forks source link

[Tracking] Breaking changes in V2 #54

Closed marvin-j97 closed 3 weeks ago

marvin-j97 commented 3 months ago

API

Data format

i18nsite commented 3 months ago

I hope that fixed-length key values ​​can be considered when designing the format. Many times, keys and values ​​can be fixed-length (such as u64 id - file hash). I believe that fixed-length fields can be optimized a lot.

I think you can refer to duckdb and consider writing data to the log regularly and compressing it into parquet format. https://duckdb.org/docs/data/parquet/overview.html https://parquet.apache.org

I believe this format does a lot of optimizations for the data

You can use this library to read and write https://docs.rs/parquet/latest/parquet/

marvin-j97 commented 3 months ago

I hope that fixed-length key values ​​can be considered when designing the format. Many times, keys and values ​​can be fixed-length (such as u64 id - file hash). I believe that fixed-length fields can be optimized a lot.

I'm not sure if fixed lengths can really be optimized in block based tables. You would at most save 3 byte per K-V pair for a lot of added complexity. It could save you some decent space for huge data sets, but not in block-based tables, and right now I don't plan on adding other types of tables.

compressing it into parquet format.

Parquet is a column-based format with row groups. There is no notion of columns or rows here, so I'm not sure there is an advantage over packed K-V blocks. I have some interest in implementing an alternative block format that is row group based. The current blocks are KVKVKVKV, but an alternative Parquet-esque format could be KKKKVVVV, which would allow for better compression, depending on the values.