Open gauteh opened 2 years ago
Indexing is now so fast (2-300 ms) usually that it is not necessary to keep a full cache of this locally. To speed up discovery it is probably best to not pre-compute the DAS and DDS requests either. I think maybe a good solution could be:
Then we are less dependent on the latency to the database server which seems to be in the 100s of ms range for a kubernetes cluster, but we can still use a standard setup.
In https://github.com/gauteh/hidefix/pull/8 a couple of different DBs have been benchmarked. The deserialization of the full index of a large file (4gb) takes about 8 us (on my laptop), its about 8 mb, and takes about 100-150 ns to read from memory-mapped type local databases (sled, heed). Reading it (8 mb binary) from redis, sqlite or similar takes about 3 to 6 ms which is maybe a bit too high. It would be interesting to also try postgres.
1) We need to keep data-discovery and dataset removal/update in mind:
2) I think that we have to assume internal network-latency is OK, I don't see how we can do much about that, except keeping communication at a minimum.
A solution could be:
Unfortunately this complicates things significantly, but I don't see how to avoid it when scaling up. It would be nice to still support a stand-alone server that does not need a central db, but just caches locally and discovers datasets itself in some way. That would make it significantly easier to test the server out.
Some reasons:
Since data is usually on network disks, caching data could possibly be done using large file system cache or maybe something like https://docs.rs/freqfs/latest/freqfs/index.html.
@magnusuMET