ARK-Builders / arklib

Core of the programs in ARK family
MIT License
1 stars 10 forks source link

Chunked mapping for storages #26

Open kirillt opened 2 years ago

kirillt commented 2 years ago

A storage is a subfolder of .ark, e.g. .ark/index or .ark/tags. It represents a mapping from ResourceId to some T.

For .ark/index, the T is Path. And for .ark/tags, the T is Set<String>. Each entry can be represented by a file .ark/<storage>/<resource_id> with a single line content. This kind of storage should give us the least amount of read/write conflicts, but not very efficient for syncing and reading. Old chunks could be batched into bigger multi-line files.

So, chunked storage would be a set of files like this:

.ark/<storage_name>/<batch_id1>
|-- <resource_id1> -> <value1>
|-- <resource_id2> -> <value2>

.ark/<storage_name>/<resource_id3>
|-- <value3>

.ark/<storage_name>/<batch_id2>
|-- <resource_id4> -> <value4>
|-- <resource_id5> -> <value5>
|-- <resource_id6> -> <value6>

.ark/<storage_name>/<resource_id7>
|-- <value7>
kirillt commented 1 year ago

It should be possible to finely tune each storage according to expected size of values. Keys are expected to be ResourceId always. Value could range from i8 for scores to Set<String> for tags and Map<String, String> for metadata.

It should be rational to keep scores and tags in single file, a line per map entry. The only motivation to use chunked storage here is to reduce amount of conflicts. A "line-per-entry" case still can be implemented using chunked storage with some setting like chunk_size = 10000.

For metadata, each entry should generate separate file which should be achievable using chunk_size = 1.

We could use also something in between for both. E.g. tags storage with chunk_size = 100 would be split into multiple files which are less likely to conflict due to writes on different devices. Again, metadata storage with chunk_size = 10 would give us 10 times less files with maybe almost the same conflict frequency, but easier synchronization across devices (this is necessary to verify, of course).

kirillt commented 1 year ago

Timestamps of modification of all chunks should be taken into account. Probably, each value should be tracked by which chunk it came from in order to invalidate it when that chunk is updated from outside.