c3-time-domain / SeeChange

A time-domain data reduction pipeline (e.g., for handling images->lightcurves) for surveys like DECam and LS4
BSD 3-Clause "New" or "Revised" License
0 stars 4 forks source link

Restoring database objects from the cache handles id incorrectly #339

Closed rknop closed 1 month ago

rknop commented 2 months ago

The cache (in util/cache.py) is used in our tests so that things that are expensive to compute don't have to be recomputed repeatedly when used in multiple tests. This also, ideally, allows us to reuse these in subsequent test runs.

The cache works by saving database fields to JSON files. However, this includes automatically-set fields like dates (no big deal) and id (which is bad). id is a field that's automatically assigned by a counter from the database, so it's always bad behavior to set it manually, as this subverts the databases's counter ensuring that things remain unique.

Here's a failure mode:

You start with a fresh databse and an empty cache. Run the tests (all of them, or just e.g. pipeline/test_pipeline.py::test_datastore_delete_everything). The tests run, and a number of cached objects are created.

Next, reinitialize your database. (Create it from scratch.) This resets all the automatic counters. Run the tests again. This time, some objects are restored from the cache. Their id values are set manually; this means that when they're inserted into the databse, the database's id counter is not incremented. Eventually, you insert another image that does use the databse counter, and you happen to hit the same number that you already manually loaded, and end up with a unique constraint violation on the id.

(Subsequent runs without reinitializing the database (just emptying it) then seem to work, not because there isn't bad behavior, but because the databse counters are high enough that we're past all of the ids that have been cached, so the collision doesn't happen again. But, still, when we use an auto-incrementing counter to set ids, we should never set the manually.)

I tried a quick patch to this by having the cache loading software always set the id fields to None. However, this led to a later detached instance error in SQLAlchemy that I didn't track down (and which probably isn't worth the effort).

THE WORKAROUND : whenenver you reinitialize your database (e.g. with docker compose down -v), also completely wipe out the cache in data/cache. This is probably overkill — there are some downloaded exposures and such that you probably don't need to wipe out — but it will avoid the database collisions. This should solve the problem most of the time. (If you do re-runs of tests without reinitializing the database, you'll probably be OK, but the code is sitll behaving badly by manually setting ids; you just will be lucky not to hit the collisions, as described above.)

rknop commented 2 months ago

The cache has further problems in that it also caches foreign key ids. There's no guarantee when rereading from a cache that the actual object pointed to by the foreign key will have the same id as it did when the cache was created. Many of the tests that use the cache explicitly reset these relationships.

rknop commented 2 months ago

One solution to this would be to move from using integer IDs to uuids for IDs. Then it would be just fine to save the cached uuid, because the whole premise is that they're unique anyway.

This would had 8 bytes to every table -- uuids require 16 bytes, whereas big integers only require 8 bytes. It might also have implications for indexing and searching speeds.

rknop commented 1 month ago

No longer relevant; database refactor moved to using uuids, which solved the problem.