ARK-Builders / arklib

Core of the programs in ARK family
MIT License
1 stars 10 forks source link

Better handling of _id_ collisions #58

Open kirillt opened 8 months ago

kirillt commented 8 months ago

At this moment, our index is structured like this:

pub struct ResourceIndex {
    pub id2path: HashMap<ResourceId, CanonicalPathBuf>,
    pub path2id: HashMap<CanonicalPathBuf, IndexEntry>,

    pub collisions: HashMap<ResourceId, usize>,

    root: PathBuf,
}

Ideally, we need id2path to have values of type HashSet<CanonicalPathBuf> because an id can have multiple path attached due to id collisions. We track a number of these collisions, but removing an id in a generic way still requires iteration through all paths to find matching entries (see the PR #57). Otherwise, if just take the path from id2path and remove it, we'll have the id left in path2id which is unreachable from id2path. And collisions[id] would be positive, too.


An idea I had some time before is composite ids:

The way I see it functioning is:

  1. When we index a resource, we compute it's id in a fast way (CRC-32 + file size).
  2. If there is no entry in the collisions mapping for the id, we just add it into id2path.
  3. If there is an entry in the collisions mapping for the id, we add value of another quick hashing function to it.
  4. If later another resource collides with this composite id again, we use SHA-256 for it.

We can use SHA-256 instead of using several hash functions, but it's very slow so we want to avoid it because we work with local user files. We might have background indexing process which upgrades fast ids to secure ids though.

Either way, if we do tricks with ids we have a problem:

  1. Library consumer's storages need to be updated with new ids, e.g. we emit events like Upgraded(old_id, new _id).
  2. If we lose the index file between app runs, we cannot catch up and upgrade consumer's storages. We can only present users with list of colliding paths involved into any of user's mappings and ask to manually select correct values...