MolSSI / QCFractal

A distributed compute and database platform for quantum chemistry.
https://molssi.github.io/QCFractal/
BSD 3-Clause "New" or "Revised" License
148 stars 48 forks source link

Error fetching number of entries if only iterating over part of a list #817

Closed chrisiacovella closed 7 months ago

chrisiacovella commented 7 months ago

Describe the bug The CI in one of my repos started grabbing v0.54 of qcportal, rather than 0.53, causing some failures, related to grabbing the list of entries in a database.

Basically, if you only iterate over a subset of the entries, the second time you access entry_names in the dataset, it only contains those that have already been fetched, unless you first issue fetch_entry_names()

To Reproduce

from qcportal import PortalClient

dataset_type = "singlepoint"
qcarchive_server = "ml.qcarchive.molssi.org"
client = PortalClient(qcarchive_server)

dataset_name = "SPICE PubChem Set 1 Single Points Dataset v1.2"

# run this the first time
ds = client.get_dataset(dataset_type=dataset_type, dataset_name=dataset_name)

entry_names = ds.entry_names
print("first pass, number of entries: ", len(entry_names))
n_to_fetch = 3
to_fetch = entry_names[0:n_to_fetch]

for entry in ds.iterate_entries(to_fetch, force_refetch=True):
    pass

# run a second time
ds = client.get_dataset(dataset_type=dataset_type, dataset_name=dataset_name)

entry_names = ds.entry_names
print("second pass, number of entries: ", len(entry_names))
n_to_fetch = 3
to_fetch = entry_names[0:n_to_fetch]

for entry in ds.iterate_entries(to_fetch, force_refetch=True):
    pass

# run a third time, but with where we call ds.fetch_entry_names()

ds = client.get_dataset(dataset_type=dataset_type, dataset_name=dataset_name)
ds.fetch_entry_names()

entry_names = ds.entry_names
print("third pass, with fetch_entry_names, number of entries: ", len(entry_names))
n_to_fetch = 3
to_fetch = entry_names[0:n_to_fetch]

for entry in ds.iterate_entries(to_fetch, force_refetch=True):
    pass

The output from this:

first pass, number of entries:  118606
second pass, number of entries:  3
third pass, with fetch_entry_names, number of entries:  118606

Expected behavior I would expect that calling ds.entry_names would get all of the names on the server, as this was the prior behavior. Maybe there just needs to be two different variables/functions to make it clearer e.g., instead of entry_names we have something like entry_names_local and entry_names_server

bennybp commented 7 months ago

Yep, this is over-zealous use of the cache. I was expecting some of these issues to pop up.

I think we should always work under the assumption that we can fetch all the names of the entries and all the specifications without too much issue, even for large datasets.

I will add this to #816 for a 0.54.1 release tomorrow