catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
473 stars 108 forks source link

Warn user when a change to the PUDL DB schema is detected #2377

Closed jdangerx closed 1 year ago

jdangerx commented 1 year ago

This is because we no longer --clobber and drop tables; the tables are created from the Package.from_resource_list().to_sql() one time when the DB is created.

### Scope
- [ ] #2331

2378 is not in scope. I tried to do this, but because the pudl_sqlite_io_manager now gets created once per process, we run into a bunch of race conditions where the different processes are trying to drop/recreate tables (or, worse, delete/recreate files on disk).

The pysqlite transaction handling has "surprising behavior," so I'm trying to get the transactions to work following this: https://docs.sqlalchemy.org/en/14/dialects/sqlite.html#pysqlite-serializable

bendnorman commented 1 year ago

After adding a couple of output tables I found recreating the database every time a schema changes pretty frustrating. I wonder if instead of deleting the table data every time an asset is materialized, we drop the table, recreate the schema and write the data. I expect sqlalchemy will throw an error when the table that is a parent in a foreign key relationship is dropped.

I think using database migrations is the best option for not having to recreate the database every time. I suspect managing the overhead of migrations is worth not having to rerun the full ETL whenever schemas change.

jdangerx commented 1 year ago

I think that migrations are a good idea, though I'm not quite sure how we want to go about it with all the concurrency still going on. Maybe we have the IO managers each do this in their initialization:

  1. begin transaction (this might implicitly happen when we run any DDL stuff?)
  2. run a migration
  3. end transaction

I think that in theory, if a different process were to try to run a migration in the middle, they would be met with some sort of lockout error. And when they tried again, the migration would be a no-op.

I'm sure there are edge cases that will break that, though - off to read a bit more about how alembic works under the hood!

jdangerx commented 1 year ago

Another thing we could do is:

Which is definitely less error-prone than "try to coordinate several concurrent processes that are all trying to run a db migration"

bendnorman commented 1 year ago

I was thinking we'd run migrations outside of dagster runs so we don't have to recreate the entire database whenever a schema changes. However, that doesn't solve the problem of notifying users when the metadata has changed and they need to recreate the database or apply migrations.

I have a few potential solutions for notifying folks when the metadata has changed:

jdangerx commented 1 year ago

After a little discussion we decided to try to initialize the DB schema (& keep it synced with our code) outside of dagster runs.

I started playing with alembic in the daz-alembic branch, and have gotten the migrations working locally.

But, as I was writing the documentation for how to use it, it felt a little bit too complicated for our use case. Here's what we'd need to do:

Since we clobber all our existing tables anyways in the io_manager (or, I think so - _handle_pandas_output has a con.execute(sa_table.delete())), we don't need all the stuff about trying to keep the existing data around, etc.

So we could get away with a script called, like, reset_db that:

Then the nightly build flow would remain the same, but the local build flow would look like:

Pro-script:

Pro-alembic:

Despite having put some work into alembic-world, I sort of think the script situation is better suited for our needs. Thoughts @bendnorman @zaneselvans @zschira?

bendnorman commented 1 year ago

The alembic work in #2504 is great and works as expected!

I think the biggest benefit of using alembic right now is that we won't have to clobber the entire database every time we change the database metadata. If we think this is worth the overhead of introducing a new dependency and concept then I think we should use Alembic.

Given our fast ETL only takes 5-10 min to run, I think it's probably fine to create a script that clobbers the database, and recreates all of the schemas. If we get through converting all of these output tables and find this behavior to be a pain in the ass we can revisit database migrations.

I vote we keep the alembic branch around but go forward with @jdangerx's simple script idea. It solves our concurrency issues, lets folks know when the metadata has changed and it's simpler to manage and understand.

bendnorman commented 1 year ago

Also it probably makes sense to go with the simplest solution given we might transition to writing all assets to parquet files and then loading them into sqlite or duckdb in a separate job.

jdangerx commented 1 year ago

tl;dr:

I did a bit more digging here after @zaneselvans ran into issues with this workflow:

  1. initialize DB
  2. run full DAG
  3. edit one asset's output schema
  4. try to materialize that asset, fail bc of db schema issue
  5. re-initialize DB, clobbering everything
  6. re-materialize all of the asset's ancestors (unless the asset only depended on pickled assets that live in DAGSTER_HOME)

Where the desired workflow is probably more like

  1. initialize DB
  2. run full DAG
  3. edit one asset's output schema
  4. run a DB migration
  5. re-materialize that one asset instead of waiting for a bunch of other stuff to run

To "run a DB migration," we can do the following:

$ alembic revision --autogenerate -m "cool migration by a smart person"`
$ alembic upgrade head

However, what if you realize you want to actually change the schema some more? There are two options:

We can just make a new migration:

$ alembic revision --autogenerate -m "cooler migration by a smarter person"
$ alembic upgrade head

Or, we can try to erase that first migration from memory and then re-generate:

$ alembic downgrade <hex_code_for_the_migration_before_you_started_all_this_mess>
$ rm migrations/versions/<hex_code>_cool_migration_by_a_smart_person.py
$ alembic revision --autogenerate -m "cool migration by a smart person, for real this time"
$ alembic upgrade head

If we want, we can try to wrap this all up in the pudl_reset_db script.

If we do that, I'd want to go the "new-migration-for-each-change" route instead of the "smush-everything-into-one-migration" route. It generates a few more migration files, but I think there's a lot less space for error & complexity to creep in.

bendnorman commented 1 year ago

The "desired workflow" sounds pretty good to me! I'm not sure what you mean by "wrapping this all up in the pudl_reset_db script". Would pudl_reset_db mostly just be an alias for basic alembic commands? I don't think it's unreasonable to asks users to get familiar with alembic commands.

jdangerx commented 1 year ago

I do mean pudl_reset_db as an alias for alembic commands, and I also think that we can just get people to learn to use alembic. It's a pretty common tool so it won't feel like "ugh, I'm learning this arcane PUDL specific stuff."

I think, then, that we basically just need to update the documentation in #2523 and merge it in!