MolSSI / QCFractal

A distributed compute and database platform for quantum chemistry.
https://molssi.github.io/QCFractal/
BSD 3-Clause "New" or "Revised" License
143 stars 47 forks source link

Occasional failures in `get_records_with_cache` #844

Open ntBre opened 1 week ago

ntBre commented 1 week ago

Describe the bug I was following up on our discussion from the meeting today of trying to use qcportal.cache.get_records_with_cache to cache record downloads outside of datasets, and I noticed that I was occasionally getting failures with a very simple script like:

import shutil
from pathlib import Path

from qcportal import PortalClient
from qcportal.cache import RecordCache, get_records_with_cache
from qcportal.optimization import OptimizationRecord

addr = "https://api.qcarchive.molssi.org:443/"
cache_dir = Path("api.qcarchive.molssi.org_443")

if cache_dir.exists():
    shutil.rmtree(cache_dir)

client = PortalClient(addr, cache_dir=".")
record_cache = RecordCache(f"{client.cache.cache_dir}/records.sqlite", False)
r1 = get_records_with_cache(
    client, record_cache, OptimizationRecord, [137149103]
)
r2 = get_records_with_cache(
    None, record_cache, OptimizationRecord, [137149103]
)

This often leads to the error below:

Traceback (most recent call last):
  File "/home/brent/omsf/scratch/qcportal-cache/simple.py", line 19, in <module>
    get_records_with_cache(
  File "/home/brent/mambaforge/envs/qcsubmit-test-basic/lib/python3.11/site-packages/qcportal/cache.py", line 653, in get_records_with_cache
    raise RuntimeError("Need to fetch some records, but not connected to a client")
RuntimeError: Need to fetch some records, but not connected to a client

but if I perturb the script slightly (such as by deleting the r1 and r2 assignments but keeping the function calls), it will run successfully. I think this means that there is some kind of timing/concurrency issue with how the records are being committed to the database, but adding a time.sleep call between the two didn't help, so that might be totally off base.

I also tried running this code in a loop to try to see how often it failed:

import shutil
from pathlib import Path

from qcportal import PortalClient
from qcportal.cache import RecordCache, get_records_with_cache
from qcportal.optimization import OptimizationRecord

addr = "https://api.qcarchive.molssi.org:443/"
cache_dir = Path("api.qcarchive.molssi.org_443")

if cache_dir.exists():
    shutil.rmtree(cache_dir)

failed = 0
for i in range(100):
    client = PortalClient(addr, cache_dir=".")
    record_cache = RecordCache(
        f"{client.cache.cache_dir}/records.sqlite", False
    )
    r1 = get_records_with_cache(
        client, record_cache, OptimizationRecord, [137149103]
    )
    try:
        r2 = get_records_with_cache(
            None, record_cache, OptimizationRecord, [137149103]
        )
    except RuntimeError as e:
        assert "Need to fetch some records" in str(e)
        print(f"failed on iter {i}")
        failed += 1
        continue
    assert r1 == r2, f"mismatch on iter {i}"

print(failed)

But as written it always fails on the first iteration and then successfully accesses the existing cache on the other 99 iterations. However, if I move the rmtree call inside the loop, I get 100 instances of this error:

Traceback (most recent call last):
  File "/home/brent/mambaforge/envs/qcsubmit-test-basic/lib/python3.11/site-packages/qcportal/record_models.py", line 437, in __del__
    self.sync_to_cache(True)  # Don't really *have* to detach, but why not
    ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/brent/mambaforge/envs/qcsubmit-test-basic/lib/python3.11/site-packages/qcportal/record_models.py", line 523, in sync_to_cache
    self._record_cache.writeback_record(self)
  File "/home/brent/mambaforge/envs/qcsubmit-test-basic/lib/python3.11/site-packages/qcportal/cache.py", line 169, in writeback_record
    self._conn.execute(stmt, row_data)
  File "src/cursor.c", line 169, in resetcursor
apsw.ReadOnlyError: ReadOnlyError: attempt to write a readonly database

To Reproduce See either snippet above. I sent the first one to Jeff (@j-wags) and he could also reproduce it, so it's not just my machine at least.

Expected behavior I expect both calls to get_records_with_cache to return the same record without throwing an exception. The first call should populate the cache and the second should read from it.

Additional context This is on the most recent version of qcportal, 0.55. I can upload my full conda environment if needed.