Local data persistence & caching

chrisiacovella commented 11 months ago

This issue is to sketch out some ideas and start a discussion related to retrieving and saving datasets. This follows from some prior discussion during working group meetings.

Local caching of records:

When accessing records from the archive, it would be very helpful to be able to store this data locally in a cache. It seems like this could come in two distinct flavors:

Automatic caching.

I'm looking at the current source code and there appears to be some framework already in place (but maybe not yet implemented?) that relies upon DBM for the automatic caching. If implemented, this would allow QCPortal to check the local cache to see if a given record from a specified server has already been retrieved, and if so, use the local version. This would certainly be very beneficial since it would mean that for many users, rerunning a python script or restarting a notebook kernel would not require re-downloading data. However, the actual performance will depend upon the amount of memory allocated to the cache and the size of a given dataset a user is working with.

User-defined caching.

This would povide the same basic functionality as the automatic cacheing, but allowing a user to define the location to store a database, where by default, the cache does not have a maximum size limit. This would be beneficial to users that are working with, say, entire datasets. For example, if say, working with the QM9 dataset, I would only like to basically download the records once and be able to store them locally for ease of access later; I don't want to have to worry about the dataset records being purged (due to downloading other data from qcarchive) or just simply having the dataset being larger than the default memory allocation. In my own work, I've implemented a simple wrapper around the calls to QCPortal where each record is saved into an SQLdict database and this has been very helpful, especially in cases where I lose connection to the database.

Ability to download entire datasets:

Some of the datasets in the older version included HDF5 files (that could be downloaded either via the portal or from zenodo). This allowed an entire dataset to be downloaded very efficiently. As an example, it would take about 5 minutes to download QM9 in the hdf5 format (~160 mb when gzipped) for ~133K records; fetching these records one at a time (using the new code) took > 12 hours. Having a way to download an entire dataset in one file would be very helpful.

A few related notes: to even download this required wrapping the calls to qcportal in try/except statements to automatically reconnect, as periodically I would lose connection. I was able to speed this up to about 15 minutes by using concurrent.futures to multithread the per-record fetches. If a single downloaded file is not possible for a dataset (e.g., given that datasets may be changing), it would be good to have some way to efficiently download all the records at a single time, saved to some local cached database. Either way, calling "get_record" in a loop at this point is not very efficient and could be a big stumbling block for users.

bennybp commented 11 months ago

There are a couple places to do caching, and we might want several:

In the client itself. This is for various functions like get_records, get_singlepoints, etc.
Caching in side the dataset (basically replacing the internal entries, specification, and record storage: https://github.com/MolSSI/QCFractal/blob/main/qcportal/qcportal/dataset_models.py#L88-L90).

Internally, datasets to call the get_ functions on the client, so we might want some flags for not storing data in two caches: https://github.com/MolSSI/QCFractal/blob/e8d9cba50b1e59bf8ff85992cd9dc8f94158fe1b/qcportal/qcportal/dataset_models.py#L655

I think I also agree that this could be exploited for quickly downloading datasets. Some care always has to be taken with cache invalidation. There already is some checking that is done when fetching records, but would need some functionality for merging/purging data that is changed on the server.

The existing (very prototype) code for this ("views") stores info in an sqlite database, with entry/specification being keys, and the record data itself being stored as zstandard-compressed blobs. For a general cache, where the user is also doing the compression, we might want to turn the compression level down, but zstd is very fast in both directions.

That being said, I was not aware of sqldict. Is this what you are using? https://github.com/RaRe-Technologies/sqlitedict. With that, it could be possible to basically remove the idea of a "view" and just make everything the cache instead.

chrisiacovella commented 11 months ago

Yes, that is the package I had used for my persistent caching of the records I downloaded. It worked well and really simplified access since I am not very experienced with sqlite. Storing the records ends up be really just as simple as:

with SqliteDict("example.sqlite", autocommit=True) as db:
   for record_info in iterate_records(specification_name='spec_2')
       db[record_info.dict()['id']] = record_info

I could imagine having something like

iterate_records(specification_name='spec_2', local_database='/path/to/example.sqlite')

Where iterate records would check that specified database for a record before fetching from the server and storing anything it fetched into the database (probably not too much different than what is already in there, just allowing the user to set where to store it, without capping the number of records).

bennybp commented 11 months ago

I have some very preliminary code with basic functionality if you would like to try it. It's in the qcportal_caching branch: https://github.com/MolSSI/QCFractal/tree/qcportal_caching branch

Basically:

Pass in a directory to the portal client
Each dataset ends up with an sqlite file within that directory
When requesting a record, it checks that file.
Fetching records from the server automatically puts them in that file

Issues:

Lots not implemented (renaming/deleting records, etc)
Size checking not implemented
Cache invalidation (the dataset object still contacts the server to see if a record has been updated. This could be smarter)
On my machine, it's about 10-20x slower than using a dictionary (but will handle datasets larger than can fit in memory)

chrisiacovella commented 11 months ago

Excellent. I will check it out and report back.

chrisiacovella commented 10 months ago

I've been testing this out. So far so good.

A few small notes.

This line:

https://github.com/MolSSI/QCFractal/blob/8dc5342ebad4d56301cb807d51500323b3e1cd9c/qcportal/qcportal/client.py#L147

I think this line needs to be modified to be:

        self.cache = PortalCache(self.address, cache_dir, cache_max_size)

to make sure it uses the sanitized version of the address that sticks https:// on it if not provided. As it is now, if I were to pass, say: "ml.qcarchive.molssi.org", the server fingerprint ends up be "None_None".

In terms of the speed compared to a local dictionary cache, I got the same performance on my machine. When implemented the sqlitedict wrapping in my code, I found that converting the keys in the sqlite database to a set substantially sped things up (since sqlitedict just emulates a dictionary like interface, but not the performance you get for lookups). However, this might not be a huge issue worth worrying about, as it will still be faster than fetching fresh, and this time is pretty minimal for the larger datasets (the 2 minutes rather than 30seconds it takes to search through sqlitedict keys is not a big deal when it takes over an hour to get the records anyway).

bennybp commented 7 months ago

The big PR is up for testing #802 . Give it a shot and let me know what you think/how it works

MolSSI / QCFractal