Data-discovery and index

In https://github.com/gauteh/hidefix/pull/8 a couple of different DBs have been benchmarked. The deserialization of the full index of a large file (4gb) takes about 8 us (on my laptop), its about 8 mb, and takes about 100-150 ns to read from memory-mapped type local databases (sled, heed). Reading it (8 mb binary) from redis, sqlite or similar takes about 3 to 6 ms which is maybe a bit too high. It would be interesting to also try postgres.

1) We need to keep data-discovery and dataset removal/update in mind:

I think datasets should be registered, not auto-discovered by the data-server: the registration could be run by another dedicated service that auto-detects/scrapes sources.
When a data-file turns out to be missing, or mtime has changed, we return an error, possibly notifying the scraper-service.

2) I think that we have to assume internal network-latency is OK, I don't see how we can do much about that, except keeping communication at a minimum.

A solution could be:

Keep a central db with the index, DAS, DDS and list of datasets. This could be an SQL server or whatever, it is only written to by the scraper.
Each worker has a local cache of datasets (index, DAS, DDS) (e.g. heed, or maybe even just in-memory): to avoid having to verify that a dataset still exists it checks the mtime of the source on request. If the mtime is changed: Update cache from server. In the case of NCML-aggregates this will not be discovered.
When the central DB is changed, cache clearing is triggered at the workers. Retrieving new data from the central server is pretty cheap. This will handle NCML-changes.
This will make it possible to extend to cloud data-sources since the central-db then would point to e.g. an s3 URL.

Unfortunately this complicates things significantly, but I don't see how to avoid it when scaling up. It would be nice to still support a stand-alone server that does not need a central db, but just caches locally and discovers datasets itself in some way. That would make it significantly easier to test the server out.

Some reasons:

Storing the full index of all datasets on every worker takes a lot of space and needs to be kept in sync
Network disk of index is slow? Embedded databases like SQLite still too slow, so then need a memory mapped DB anyway
Indexing on-demand too slow, especially for aggregated datasets.

Since data is usually on network disks, caching data could possibly be done using large file system cache or maybe something like https://docs.rs/freqfs/latest/freqfs/index.html.

@magnusuMET

Indexing is now so fast (2-300 ms) usually that it is not necessary to keep a full cache of this locally. To speed up discovery it is probably best to not pre-compute the DAS and DDS requests either. I think maybe a good solution could be:

A central database server with:
- List of datasets
- Computed DAS and DDS (optionally, otherwise inserted on first access)
- Stored hidefix index blob
- Aggregated coordinate dimension
Each server has a fairly large local memorymapped (or just in-memory if enough RAM) LRU cache of hidefix-indexes and DAS, DDS'es. Otherwise fetch or compute on database server.

Then we are less dependent on the latency to the database server which seems to be in the 100s of ms range for a kubernetes cluster, but we can still use a standard setup.

gauteh / dars

Data-discovery and index #19