Explore another format for manifest

apache / horaedb

Apache HoraeDB (incubating) is a high-performance, distributed, cloud native time-series database.

https://horaedb.apache.org

Apache License 2.0

2.65k stars 205 forks source link

Explore another format for manifest #1600

Open jiacai2050 opened 3 days ago

jiacai2050 commented 3 days ago

Describe This Problem

Currently our manifest is defined using protobuf, that's: https://github.com/apache/horaedb/blob/9e81c4ed5df1998cbd210dc48fc67b6b7405a553/horaedb/pb_types/protos/sst.proto#L32

Protobuf is useful for schema evolution, but not very efficient in our case:

For Vec<struct> field, protobuf will serialize metadata of every struct, which is a waste of space.
Serialize manifest body using protobuf in a whole make it hard to update incrementally.

Proposal

We can serialize manifest all by ourselves, a proposed format:

| version(u8) |  Record(N) ... |

# Record is a self-descriptive message
| id(u64) | time_range(i64*2)| size(u32) |

When update incrementally, we can just append new record in the end.

Additional Context

This is where manifest get merged:

https://github.com/apache/horaedb/blob/e2970b1171523b182b36dc67e642641c47db078f/horaedb/metric_engine/src/manifest.rs#L269

zealchen commented 1 day ago

Interesting, I'll get into it.

zealchen commented 13 hours ago

Here are some key design considerations to clarify in advance:

Since the sstmeta schema may evolve over time, we need to ensure backward compatibility for each of the self-descriptive record. If so, I'm wondering how and when these manifest files will be utilized.
The object storage crate (e.g., LocalFileSystem) does not appear to have an append interface. This implies that during the do_merge operation, we would need to load the entire file at snapshot_path into memory, append the new data, and then write the entire file back.
What is the expected order of magnitude for the number of sstmeta files? For example, are we dealing with millions?

jiacai2050 commented 3 hours ago

Those are all great questions,

We use version field to deal with schema evolution, if we want to add some fields to manifest, a new version could be added, and when merge, to convert old manifest to new one.
Object store don't have the append interface, so append here means download the old one into memory, then merge with new delta manifests, then upload it again to overwrite the old one.
There shouldn't many manifest files, we have a hard limit on how many delta files a manifest can have, if there are more than that, the write process will fail, only after the merge process is finished, we can allow creating new manifest deltas.

As for the third question, that why we need to keep metadata of each sst small, so we can hold millions of sst files in one manifest snapshot, whose size is less than 1GB.

1024*1024*1024 / 28 (size of each sst's metadata) = 38347922