Open tareknaser opened 4 months ago
Related discussion: https://github.com/ARK-Builders/arklib/pull/87#discussion_r1494621339
Just to port our discussion from the call earlier today:
update
: Method to update the index by rescanning files and returning changes (additions/deletions/updates).
get_resource_by_id
: Query a resource from the index by its ID.get_resource_by_path
: Query a resource from the index by its path.
track_addition
: Track a newly added file (checks if the file exists in the file system).track_deletion
: Track the deletion of a file (checks if it was actually deleted).track_update
: Track an update on a single file
Additionally, ResourceIndex
should maintain the double hashmaps for storage
Thanks @tareknaser for putting this all together. Looks correct.
Auxiliary comments about these APIs:
Reactive API is the main API, it detects updates which can be pulled by the user/app.
Snapshot API is a simpler to use API for experimental apps or for apps without performance requirements.
Track API is an additional API for advanced developers. Developers must understand what they do when they use this API. Use-cases include saving a call to Reactive API when the user/app precisely knows what entry has been changed. Another use-case is implementing a foreground service in Android which would push updates using intents to subscribed apps. In this case, the subscriber app must update local index correspondingly and this code would be done by us as part of Android API.
File System Watcher can be seen as push-based version of Reactive API.
Additionally,
ResourceIndex
should maintain the double hashmaps for storage.
That's only for the sake of performance. But if benchmarks show that it isn't the best thing, or that simpler structure is enough, then we can use something else.
That's only for the sake of performance. But if benchmarks show that it isn't the best thing, or that simpler structure is enough, then we can use something else.
UPDATE:
I tried updating ResourceIndex
to use a vector of resources instead of double mappings. This change mainly impacted the Snapshot API and led to a significant hit in performance. It might be better to stick with the current structure.
Before
pub struct ResourceIndex {
/// The root path of the index (canonicalized)
pub(crate) root: PathBuf,
/// A map from resource IDs to resources
pub(crate) id_to_resource: HashMap<HashType, IndexedResource>,
/// A map from resource paths to resources
pub(crate) path_to_resource: HashMap<PathBuf, IndexedResource>,
#[cfg(feature = "non-cryptographic-hash")]
/// A map from resource IDs to the number of collisions (i.e., resources with the same ID)
pub(crate) collisions: HashMap<HashType, usize>,
}
After
#[derive(Clone, Debug)]
pub struct ResourceIndex {
/// The root path of the index (canonicalized)
pub(crate) root: PathBuf,
// Indexed resources
pub(crate) resources: Vec<IndexedResource>,
#[cfg(feature = "non-cryptographic-hash")]
/// A map from resource IDs to the number of collisions (i.e., resources with the same ID)
pub(crate) collisions: HashMap<HashType, usize>,
}
Here is the benchmarks changes from the double mappings to the vector:
Running benches/index_query_benchmarks.rs (target/release/deps/bench_get_resource-daa9ca3ecffcaba9)
Gnuplot not found, using plotters backend
index_query/get_resource_by_id
time: [1.6808 µs 1.6840 µs 1.6884 µs]
change: [+2612.5% +2626.4% +2643.6%] (p = 0.00 < 0.05)
Performance has regressed.
Found 3 outliers among 100 measurements (3.00%)
3 (3.00%) high severe
index_query/get_resource_by_path
time: [37.287 µs 37.326 µs 37.365 µs]
change: [+128223% +128390% +128558%] (p = 0.00 < 0.05)
Performance has regressed.
Found 8 outliers among 100 measurements (8.00%)
5 (5.00%) low mild
2 (2.00%) high mild
1 (1.00%) high severe
index_query/update_index
time: [16.421 ms 16.444 ms 16.473 ms]
change: [+3.9468% +4.8545% +5.7332%] (p = 0.00 < 0.05)
Performance has regressed.
Found 10 outliers among 100 measurements (10.00%)
2 (2.00%) high mild
8 (8.00%) high severe
index_query/track_addition
time: [16.414 µs 16.430 µs 16.448 µs]
change: [-2.3982% -2.1804% -1.9777%] (p = 0.00 < 0.05)
Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
6 (6.00%) high mild
3 (3.00%) high severe
index_query/track_removal
time: [47.164 µs 47.184 µs 47.211 µs]
change: [+2065.1% +2074.3% +2081.2%] (p = 0.00 < 0.05)
Performance has regressed.
Found 9 outliers among 100 measurements (9.00%)
3 (3.00%) high mild
6 (6.00%) high severe
This issue is an open discussion to changes proposed to the API in
src/index.rs
.Current situation
Some of the issues with the current structure are:
ResoureIndex
which prevents us from simply serializing the index to a fileprovide
may be confusing when compared to other methodsProposal
ResourceIndex
[derive(PartialEq, Clone, Debug, Serialize, Deserialize)]
pub struct ResourceIndex { pub resources: Vec,
pub root: PathBuf,
}