diba-io / carbonado

An apocalypse-resistant data storage format for the truly paranoid.
MIT License
105 stars 7 forks source link

Catalog Indexes #3

Open cryptoquick opened 1 year ago

cryptoquick commented 1 year ago

Catalogs are a flat file snapshot of a database state. They are loaded into memory as a BTreeMap and flushed to disk only after a flush method is explicitly called, thus allowing a form of batching, and implementation flexibility over managed behavior, which in some ways is preferable over an embedded database, for example.

Catalogs will facilitate chunking of large files into, say, 16MB chunks, and also, chunks help with parallelization of codec tasks over num_cpus.

Catalogs should have a struct that contains a BTreeMap, expose all its methods, and include new and flush utility methods on it.

Maybe in a later issue, a flush can be performed on drop (but only after checking explicitly flushed state), but this is a lot of work and introduces some magical behavior.

dr-orlovsky commented 1 year ago

They are loaded into memory as a BTreeMap and flushed to disk only after a flush method is explicitly called

This implies a persistence of a process in the memory.

What about situation when a library is used, alike sqlite or git? When the data are persisted on a disk and not in a memory?

Catalogs will facilitate chunking of large files into, say, 16MB chunks,

I think more important is to allow to preserve a semantic file structure, instead of a fixed-size chunks. Like in macOS packages and tarballs vs let's say torrents.

Maybe in a later issue, a flush can be performed on drop

Nope, since the process may not be given a time to complete the flush

cryptoquick commented 1 year ago

Good points. The complication is that the database file is encoded and hashed each time, so it can also be replicated. Maybe this could use a format like c3 that doesn't use hashes, but still uses compression and encryption. Every write would block until all new data had flushed to disk, perhaps even appended to the file. This would be similar to a write-ahead log. Then the BTreeMap could use keys that point to offsets within the file.

I think more important is to allow to preserve a semantic file structure, instead of a fixed-size chunks. Like in macOS packages and tarballs vs let's say torrents.

This happens at a higher level than this library, Carbonado files aren't really the same as tarballs, they're for individual files, and if a file is very large, it should be chunked into separate segment files, so it can be processed in parallel. For multiple files, they're tracked using a catalog that has metadata such as filenames, what segments are needed and in what order. This is similar to the function of IPLD. A torrent file also can include multiple files, so that doesn't apply.