ChainSafe / forest

🌲 Rust Filecoin Node Implementation
https://forest.chainsafe.io
Apache License 2.0
635 stars 155 forks source link

Database overlay backed by CAR file #3074

Closed lemmih closed 1 year ago

lemmih commented 1 year ago

Issue summary

A CAR file is an unordered stream of (CID, Ipld) pairs representing a (possibly incomplete) DAG. We often have to query data in a CAR file, and we currently do it by loading each key-value pair into a database. This is relatively slow, though, and it temporarily doubles the required storage space.

It should be possible to memory map a CAR file, scan through each key-value pair, build a mapping from key to position in the mapped file, and query it directly without using a database. Keeping the entire index in memory is feasible since each key is only 4 bytes, and there are roughly 55 million pairs in a mainnet snapshot file.

Doing the same with a zstd compressed CAR file is more work but still doable.

Use cases:

Other information and links

lemmih commented 1 year ago

See tokio_util::codec for frame reading infrastructure. It should be possible to decode a stream of bytes (compressed CAR file) into a stream of key-value pairs annotated with the offset of the zstd frame and the byte offset of the value inside the frame. This would allow building an index for directly querying a compressed CAR file.

aatifsyed commented 1 year ago

Zstd frames tend to be large (archives are often one frame), and must be decompressed linearly, so I don't think we should support compressed archives for this feature, at least not for MVP.

That is, seeking to offset 1024 involves seeking to 0, decompressing til 1024, then continuing. We'll absolutely demolish the disk.

It should be possible to memory map a CAR file

Not for very large files, we'd at least get bitten on random access.

I can do some initial investigation into this this week

lemmih commented 1 year ago

Zstd frames tend to be large (archives are often one frame), and must be decompressed linearly, so I don't think we should support compressed archives for this feature, at least not for MVP.

That is, seeking to offset 1024 involves seeking to 0, decompressing til 1024, then continuing. We'll absolutely demolish the disk.

Forest generates the compressed CAR files and we can set the frame count to suit our needs. That said, starting without support for compressed archives would still be a great first step.

I can do some initial investigation into this this week

Sounds great!