ARK-Builders / arklib

Core of the programs in ARK family
MIT License
1 stars 10 forks source link

#15: Persisted/Replicated index #29

Closed kirillt closed 1 year ago

kirillt commented 1 year ago

Solves #15 by writing known ResourceIndex into .ark/index file. Consecutive scans re-use this file for quicker index loading.

At this moment, format of .ark/index file is plain-text and is the following:

<modified> <filesize>-<crc32> <path>

Only relative paths must be stored there in order to read the index on any machine with the folder checked out by any path. But paths are kept only as absolute paths in memory, using CanonicalPathBuf type.

There was a bug in update function making this optimization ineffective. Somehow, timestamps stored and retrieved do differ even without any modification of the resource. The timestamps are in nanoseconds. In my experiments, ~ 700ns was the difference, but not 1ms what would mean timestamps are just truncated. It should be investigated better, but basically, any call to update was resulting in removing everything and adding it again into index. Threshold of 1ms is introduced now. This would also help with potential resources being modified by some process with high frequency.

ResourceKind mocks are removed. Kinds should not be stored in .ark/index file for quicker and simpler indexing. Most likely, kinds should have their own table or be stored together with extra metadata (.ark/meta). Latter approach would mean we write computed values into user-data though. Probably, we should create dedicated storage for generated metadata and call it "properties storage" (.ark/prop). That would make separation between cache and user-data explicit, values like kinds and pages amount in documents would go to .ark/prop and values like titles and descriptions of web bookmarks would go to .ark/meta.

kirillt commented 1 year ago
kirillt commented 1 year ago

The index is sorted:

  1. By modified timestamp. The oldest resources first, the newest go at the end. Maybe in future we'll implement chunked storage, so older resources would go to separate file which would have low probability of being updated, so less conflicts in older "generation" would be expected.
  2. By file size. Smaller first.
  3. By CRC-32 checksum.