Closed bennybp closed 7 months ago
After testing this, the way records are cached (with all their child records) is not going to work. It works for singlepoints, but torsiondrives are way too big. So this is going to need a bit more work
I think the solution is to store individual records (without children) in a separate table, and store foreign keys in the current record_data
table. Probably need to rename some of these to something like dataset_records
or something.
I've been playing around with this today, and it's great! Very intuitive. Two comments along with their importance out of 10 (I wouldn't consider either one blocking).
cache_dir
, but then to load things using dataset_from_cache
I need to provide a file path inside that cache dir, with no API point that tells me the path (as in, I have to go to my file system and ls
around to get the path to the cache file).The code that I'm playing with is:
import qcportal
client = qcportal.PortalClient("https://api.qcarchive.molssi.org:443", cache_dir="./cache2")
ds = client.get_dataset("torsiondrive", "XtalPi Shared Fragments TorsiondriveDataset v1.0")
# the next two lines didn't immediately do what I wanted, so I ran the loops below
#ds.fetch_entries()
#ds.fetch_records(include=["optimizations"], force_refetch=True)
for entry in ds.iterate_entries():
entry
for record in ds.iterate_records():
for angle, opt in record[2].minimum_optimizations.items():
opt.final_molecule
The resulting cache file size is pretty reasonable:
(bespokefit) jw@mba$ ls -lrth cache2/api.qcarchive.molssi.org_443/dataset_378.sqlite
-rw-r--r-- 1 jeffreywagner staff 13M Feb 16 15:32 cache2/api.qcarchive.molssi.org_443/dataset_378.sqlite
Then in a separate interpreter (and with minor changes to qcsubmit):
from qcportal import dataset_models
ds2 = dataset_models.dataset_from_cache("./cache2/api.qcarchive.molssi.org_443/dataset_378.sqlite")
from openff.qcsubmit.results import TorsionDriveResultCollection
tdrc = TorsionDriveResultCollection.from_datasets([ds2])
tdrc
TorsionDriveResultCollection(entries={'local': [TorsionDriveResult(type='torsion', record_id=119138412, cmiles='[H:12][C:1](=[C:3]1[C:4](=[O:9])[N:8]([C@@:6]([C:5](=[O:10])[N:7]1[H:18])([H:17])[O:11][C:2]([H:14])([H:15])[H:16])[H:19])[H:13]', inchi_key='BFHIBVQMZBLHGM-OCSBBNMYNA-N'), TorsionDriveResult(type='torsion', ...
Success!!
Glad it's working so far!
* (5/10) "cache" makes me think that this will start overwriting itself in some conditions. I'd like to have a mode that's like "I have infinite disk space, don't limit the cache size". Is this the way it works currently/could this mode be added? If not, could the behavior be documented?
At the moment, there is no limit to the cache (basically if you set the max size to None
). I haven't added the logic for finite cache sizes yet. But yes, docs always need to be written.
* (2/10) It's a little incongruous that the API has me specify `cache_dir`, but then to load things using `dataset_from_cache` I need to provide a file path _inside_ that cache dir, with no API point that tells me the path (as in, I have to go to my file system and `ls` around to get the path to the cache file).
If you create a client with the same cache_dir
then it should automatically find the existing cache files when you use get_dataset
. At least that is the intent. The dataset_from_cache
is more for if you want to pass around/download the cache file separately (kind of like the old 'views').
I'm going to go ahead and merge this. There's still some tasks to be done before the next release, but it seems to be working well.
The main reason is I have another feature being built on top of this and leaving this open makes it a bit complicated.
Description
Previously, dataset information was not cached locally at all. So rerunning a script, or just calling
client.get_dataset
again could require re-fetching all data, even if it had been fetched before.This PR implements this caching. All storage of records is now in an SQLite database, either in a file or in memory. Some care has been taken in trying to keep the cache up-to-date as much as possible, but I am sure there are still loopholes. This includes records writing themselves back to the cache when they have been updated with additional data (for example, fetching molecules or trajectories).
There are a few ways to use:
cache_dir
parameter when creating a client. This will then automatically create SQLite files for each dataset, and re-use them as long as the samecache_dir
is used in subsequent client construction.dataset_from_cache
function where you can pass a file directly (ie, downloaded out-of-band).dataset_from_cache
function indataset_models.py
that works similarly, but will result in an offline dataset object completely disconnected from any server.This is purely a client-side change, so this branch will work with the currently-deployed MolSSI QCArchive servers.
There is still some polishing to be done (and docs), but I am looking for feedback and any bugs before merging.
See #740
Todos and missing features:
refresh_cache
needs to be finishedChangelog description
Implement client-side caching of datasets
Status