ARK-Builders / arklib

Core of the programs in ARK family
MIT License
1 stars 10 forks source link

Index generic over hash function #75

Closed kirillt closed 6 months ago

kirillt commented 8 months ago

We can abstract ResourceIndex over the id type and, consequently, over hashing function.

#[derive(Eq, Ord, PartialEq, PartialOrd, Hash, Clone, Debug)]
pub struct IndexEntry<Id> {
    pub modified: SystemTime,
    pub id: Id,
}

#[derive(PartialEq, Clone, Debug)]
pub struct ResourceIndex<Id> {
    pub id2path: HashMap<Id, CanonicalPathBuf>,
    pub path2id: HashMap<CanonicalPathBuf, IndexEntry<Id>>,

    pub collisions: HashMap<Id, usize>,
    root: PathBuf,
}

This way, we can use cryptographic hash functions to test index in simpler cases, when no collisions are present. This can simplify development and debugging. We could also use fake hash functions for testing only collisions.

Cryptographic hash function can also be used for apps where safety is more important than performance. The fast hash functions will be used in experimental "fast mode", it's the most useful for file browser apps.

kirillt commented 8 months ago

We should also support two modes of operation for the library:

tareknaser commented 8 months ago

We could also support two modes of operation for the library:

* `safe` or `strict` using a cryptographic hash function

* `fast` using a fast hash function (experimental mode)

This is a very interesting idea To achieve this, we can use features with conditional compilation

tareknaser commented 7 months ago

Maybe we can implement ResourceIdTrait as an alternative to abstracting ResourceIndex over the ID type directly. It forces consistency across different ID implementations and simplifies handling different hashing algorithms. Each ResourceId struct will implement ResourceIdTrait and has its own logic for calculating the hash (which can be of different types).

Check out genericResourceId branch for a draft implementation, including how ResourceIdCrc32 works with the trait.

kirillt commented 7 months ago

@tareknaser When we finish #90 and move the code to ark-rust repo, is it a good idea to split "secure" and "fast" indexes into separate crates? This way we could create optimized index structure without counting collisions when we use secure hash functions.

I also imagine creating something like test-fs-index crate which would verify that fs-index-fast and fs-index-secure are semantically equivalent.

tareknaser commented 7 months ago

is it a good idea to split "secure" and "fast" indexes into separate crates?

I thought about this as well but couldn't find a straightforward solution. Ideally, I would prefer to decouple the index entirely from the resource ID computations, but this isn't feasible because we need to track collisions with non-cryptographic hash functions.

Once this PR is merged alongside https://github.com/ARK-Builders/arklib/pull/87, I plan to dedicate some time to exploring options for conditionally compiling the collision resolution functionality.

for example, we could do

#[derive(PartialEq, Clone, Debug)]
pub struct ResourceIndex {
    pub id2path: HashMap<ResourceId, CanonicalPathBuf>,
    pub path2id: HashMap<CanonicalPathBuf, IndexEntry>,

    #[cfg(feature = "non-cryptographic")]
    pub collisions: HashMap<ResourceId, usize>,
    root: PathBuf,
}

but I would dive deeper into it when we settle on the API for ResourceIndex first because it would require modifications all over the code.

kirillt commented 7 months ago

@tareknaser I like how conditional compilation looks in your example.

    #[cfg(feature = "non-cryptographic")]
    pub collisions: HashMap<ResourceId, usize>,

If it'll have same ergonomics all over the crate, then we could stick to it. Single index implementation should be better. Although it might be a bit more complicated to maintain.

I also imagine creating something like test-fs-index crate which would verify that fs-index-fast and fs-index-secure are semantically equivalent.

This test-fs-index crate would depend on 2 instances of same crate then: 1 with non-cryptographic flag ON, and one with the flag OFF. Is it technically possible?

Ideally, I would prefer to decouple the index entirely from the resource ID computations, but this isn't feasible because we need to track collisions with non-cryptographic hash functions.

Not sure about this part. In fact, ResourceId should be separate from index, because we might have non-fs index later. Or even apps without index, especially when the resources are not persisted. Collisions tracker is updated during index construction and updates, are you sure ResourceId is coupled to it?

kirillt commented 7 months ago

Actually, we already have such a crate: https://github.com/ARK-Builders/ark-rust/blob/main/data-resource

tareknaser commented 6 months ago

If it'll have same ergonomics all over the crate, then we could stick to it. Single index implementation should be better. Although it might be a bit more complicated to maintain.

Once the API in issue #91 is finalized, we'll likely simplify it. Additionally, some tests in index.rs might be better suited for integration testing within ark-rust using ark-cli crate. This way we validate the project's integrity between crates as well.

Simplifying the API and including only essential unit tests related to ResourceIndex would make maintenance easier if we add compilation features.

This test-fs-index crate would depend on 2 instances of same crate then: 1 with non-cryptographic flag ON, and one with the flag OFF. Is it technically possible?

It may not be feasible at the moment, but there's an ongoing discussion about it. See https://github.com/rust-lang/cargo/issues/2980#issuecomment-1552153211

Maybe that could work as an integration test. We could create a simple shell script that uses ark-cli to interact with fs-index, running it twice in the CI environment: once with the feature enabled and once disabled.

are you sure ResourceId is coupled to it?

It's not coupled per se but if we're using a non-cryptographic ResourceId, collisions still need to be tracked.