Implement client-side dataset caching

bennybp commented 8 months ago

Description

Previously, dataset information was not cached locally at all. So rerunning a script, or just calling client.get_dataset again could require re-fetching all data, even if it had been fetched before.

This PR implements this caching. All storage of records is now in an SQLite database, either in a file or in memory. Some care has been taken in trying to keep the cache up-to-date as much as possible, but I am sure there are still loopholes. This includes records writing themselves back to the cache when they have been updated with additional data (for example, fetching molecules or trajectories).

There are a few ways to use:

Set the cache_dir parameter when creating a client. This will then automatically create SQLite files for each dataset, and re-use them as long as the same cache_dir is used in subsequent client construction.
The PortalClient has a dataset_from_cache function where you can pass a file directly (ie, downloaded out-of-band).
There is a free dataset_from_cache function in dataset_models.py that works similarly, but will result in an offline dataset object completely disconnected from any server.

This is purely a client-side change, so this branch will work with the currently-deployed MolSSI QCArchive servers.

There is still some polishing to be done (and docs), but I am looking for feedback and any bugs before merging.

See #740

Todos and missing features:

[ ] refresh_cache needs to be finished
[ ] Function to delete the cache(s)
[ ] Any sort of size restriction is not obeyed
[ ] Docs, of course

Changelog description

Implement client-side caching of datasets

Status

[x] Code base linted
[x] Ready to go

bennybp commented 8 months ago

After testing this, the way records are cached (with all their child records) is not going to work. It works for singlepoints, but torsiondrives are way too big. So this is going to need a bit more work

I think the solution is to store individual records (without children) in a separate table, and store foreign keys in the current record_data table. Probably need to rename some of these to something like dataset_records or something.

j-wags commented 7 months ago

I've been playing around with this today, and it's great! Very intuitive. Two comments along with their importance out of 10 (I wouldn't consider either one blocking).

(5/10) "cache" makes me think that this will start overwriting itself in some conditions. I'd like to have a mode that's like "I have infinite disk space, don't limit the cache size". Is this the way it works currently/could this mode be added? If not, could the behavior be documented?
(2/10) It's a little incongruous that the API has me specify cache_dir, but then to load things using dataset_from_cache I need to provide a file path inside that cache dir, with no API point that tells me the path (as in, I have to go to my file system and ls around to get the path to the cache file).

The code that I'm playing with is:

import qcportal

client = qcportal.PortalClient("https://api.qcarchive.molssi.org:443", cache_dir="./cache2")
ds = client.get_dataset("torsiondrive", "XtalPi Shared Fragments TorsiondriveDataset v1.0")

# the next two lines didn't immediately do what I wanted, so I ran the loops below
#ds.fetch_entries()
#ds.fetch_records(include=["optimizations"], force_refetch=True) 

for entry in ds.iterate_entries():
    entry
for record in ds.iterate_records():
    for angle, opt in record[2].minimum_optimizations.items():
        opt.final_molecule

The resulting cache file size is pretty reasonable:

(bespokefit) jw@mba$ ls -lrth cache2/api.qcarchive.molssi.org_443/dataset_378.sqlite
-rw-r--r--  1 jeffreywagner  staff    13M Feb 16 15:32 cache2/api.qcarchive.molssi.org_443/dataset_378.sqlite

Then in a separate interpreter (and with minor changes to qcsubmit):

from qcportal import dataset_models
ds2 = dataset_models.dataset_from_cache("./cache2/api.qcarchive.molssi.org_443/dataset_378.sqlite")
from openff.qcsubmit.results import TorsionDriveResultCollection
tdrc = TorsionDriveResultCollection.from_datasets([ds2])
tdrc

TorsionDriveResultCollection(entries={'local': [TorsionDriveResult(type='torsion', record_id=119138412, cmiles='[H:12][C:1](=[C:3]1[C:4](=[O:9])[N:8]([C@@:6]([C:5](=[O:10])[N:7]1[H:18])([H:17])[O:11][C:2]([H:14])([H:15])[H:16])[H:19])[H:13]', inchi_key='BFHIBVQMZBLHGM-OCSBBNMYNA-N'), TorsionDriveResult(type='torsion', ...

Success!!

bennybp commented 7 months ago

Glad it's working so far!

* (5/10) "cache" makes me think that this will start overwriting itself in some conditions. I'd like to have a mode that's like "I have infinite disk space, don't limit the cache size". Is this the way it works currently/could this mode be added? If not, could the behavior be documented?

At the moment, there is no limit to the cache (basically if you set the max size to None). I haven't added the logic for finite cache sizes yet. But yes, docs always need to be written.

* (2/10) It's a little incongruous that the API has me specify `cache_dir`, but then to load things using `dataset_from_cache` I need to provide a file path _inside_ that cache dir, with no API point that tells me the path (as in, I have to go to my file system and `ls` around to get the path to the cache file).

If you create a client with the same cache_dir then it should automatically find the existing cache files when you use get_dataset. At least that is the intent. The dataset_from_cache is more for if you want to pass around/download the cache file separately (kind of like the old 'views').

bennybp commented 7 months ago

I'm going to go ahead and merge this. There's still some tasks to be done before the next release, but it seems to be working well.

The main reason is I have another feature being built on top of this and leaving this open makes it a bit complicated.

MolSSI / QCFractal