jdangerx commented 1 year ago

This is because we no longer --clobber and drop tables; the tables are created from the Package.from_resource_list().to_sql() one time when the DB is created.

### Scope
- [ ] #2331

2378 is not in scope. I tried to do this, but because the `pudl_sqlite_io_manager` now gets created once per process, we run into a bunch of race conditions where the different processes are trying to drop/recreate tables (or, worse, delete/recreate files on disk).

The pysqlite transaction handling has "surprising behavior," so I'm trying to get the transactions to work following this: https://docs.sqlalchemy.org/en/14/dialects/sqlite.html#pysqlite-serializable

bendnorman commented 1 year ago

After adding a couple of output tables I found recreating the database every time a schema changes pretty frustrating. I wonder if instead of deleting the table data every time an asset is materialized, we drop the table, recreate the schema and write the data. I expect sqlalchemy will throw an error when the table that is a parent in a foreign key relationship is dropped.

I think using database migrations is the best option for not having to recreate the database every time. I suspect managing the overhead of migrations is worth not having to rerun the full ETL whenever schemas change.

jdangerx commented 1 year ago

I think that migrations are a good idea, though I'm not quite sure how we want to go about it with all the concurrency still going on. Maybe we have the IO managers each do this in their initialization:

begin transaction (this might implicitly happen when we run any DDL stuff?)
run a migration
end transaction

I think that in theory, if a different process were to try to run a migration in the middle, they would be met with some sort of lockout error. And when they tried again, the migration would be a no-op.

I'm sure there are edge cases that will break that, though - off to read a bit more about how alembic works under the hood!

jdangerx commented 1 year ago

Another thing we could do is:

still just have the ETL throw an error if the metadata is mismatched
in nightly builds, have the script run a db migration before starting the ETL process
locally, have a human say, "aw shoot!" and then run the db migration manually

Which is definitely less error-prone than "try to coordinate several concurrent processes that are all trying to run a db migration"

bendnorman commented 1 year ago

I was thinking we'd run migrations outside of dagster runs so we don't have to recreate the entire database whenever a schema changes. However, that doesn't solve the problem of notifying users when the metadata has changed and they need to recreate the database or apply migrations.

I have a few potential solutions for notifying folks when the metadata has changed:

Simple solution: Catch any sqlite3. OperationalError in SQLiteIOManager.handle_output() and add an additional message like "If you recently updated resource definitions in pudl.metadata.resources, delete the pudl.sqlite database and rerun the ETL". This is fine but folks might end up clobbering their entire database when an asset dataframe does have a table schema in the database, but the actual data in the df violates the constraint in the schema. This won't break anything but might create some unnecessary headaches.
Less simple solution: Maybe we can just move the metadata comparison logic in #2331 to SQLiteIOManager.handle_output(). I don't think this will result in concurrency issues because the database file and metadata will have been created by the time handle_output() is called. The comparison could happen before calling df.to_sql or after a sqlite3. OperationalError is thrown.
Simplest solution: Don't change anything and make it clear you should clobber your database if you get a SQLite error saying there is a missing table or field.

jdangerx commented 1 year ago

After a little discussion we decided to try to initialize the DB schema (& keep it synced with our code) outside of dagster runs.

I started playing with alembic in the daz-alembic branch, and have gotten the migrations working locally.

But, as I was writing the documentation for how to use it, it felt a little bit too complicated for our use case. Here's what we'd need to do:

for nightly builds: include alembic upgrade in the pipeline. easy.
for local builds: throw an error when metadata doesn't match or when the db just doesn't exist. then, get the user to run alembic revision --autogenerate -m "some message", look over the autogenerated migration, and then runalembic upgrade`. kind of a pain.

Since we clobber all our existing tables anyways in the io_manager (or, I think so - _handle_pandas_output has a con.execute(sa_table.delete())), we don't need all the stuff about trying to keep the existing data around, etc.

So we could get away with a script called, like, reset_db that:

deletes the db
creates the db
creates the schema based on the metadata, no questions asked.

Then the nightly build flow would remain the same, but the local build flow would look like:

throw an error when metadata doesn't match; user runs reset_db and things are hunky dory again.

Pro-script:

the script is really simple to write & run
we avoid pulling in alembic as a dependency
this still solves our "concurrently trying to set up the db schema" problem

Pro-alembic:

if we want to do migrations in the future (e.g. if we want to not delete every table every time we write out - which isn't a given), we'll have to set up alembic anyways
we don't have to maintain a script, we just have to make sure alembic keeps working
any fancier features in alembic (rollback, support for other DB backends) come for free vs. having to implement them
might make us be more thoughtful about how to reconcile multiple simultaneous changes to the package metadata

Despite having put some work into alembic-world, I sort of think the script situation is better suited for our needs. Thoughts @bendnorman @zaneselvans @zschira?

bendnorman commented 1 year ago

The alembic work in #2504 is great and works as expected!

I think the biggest benefit of using alembic right now is that we won't have to clobber the entire database every time we change the database metadata. If we think this is worth the overhead of introducing a new dependency and concept then I think we should use Alembic.

Given our fast ETL only takes 5-10 min to run, I think it's probably fine to create a script that clobbers the database, and recreates all of the schemas. If we get through converting all of these output tables and find this behavior to be a pain in the ass we can revisit database migrations.

I vote we keep the alembic branch around but go forward with @jdangerx's simple script idea. It solves our concurrency issues, lets folks know when the metadata has changed and it's simpler to manage and understand.

bendnorman commented 1 year ago

Also it probably makes sense to go with the simplest solution given we might transition to writing all assets to parquet files and then loading them into sqlite or duckdb in a separate job.

jdangerx commented 1 year ago

tl;dr:

We can make changing DB schemas suck less by applying DB migrations. Should we? I'm leaning yes.
If we want to, we can make our users (and selves) use the stock alembic tooling or adapt our pudl_reset_db script to make it a little bit smoother. Which should we do? I'm leaning using the stock alembic tooling.
We can either try to make new migrations every change, or try to retroactively smush all the changes into one migration. We fight the tooling a lot less with the former, though the latter produces a slightly cleaner diff for PRs. Auto-smushing seems really error-prone, I think we should encourage making new migrations every change.

I did a bit more digging here after @zaneselvans ran into issues with this workflow:

initialize DB
run full DAG
edit one asset's output schema
try to materialize that asset, fail bc of db schema issue
re-initialize DB, clobbering everything
re-materialize all of the asset's ancestors (unless the asset only depended on pickled assets that live in DAGSTER_HOME)

Where the desired workflow is probably more like

initialize DB
run full DAG
edit one asset's output schema
run a DB migration
re-materialize that one asset instead of waiting for a bunch of other stuff to run

To "run a DB migration," we can do the following:

$ alembic revision --autogenerate -m "cool migration by a smart person"`
$ alembic upgrade head

However, what if you realize you want to actually change the schema some more? There are two options:

We can just make a new migration:

$ alembic revision --autogenerate -m "cooler migration by a smarter person"
$ alembic upgrade head

Or, we can try to erase that first migration from memory and then re-generate:

$ alembic downgrade <hex_code_for_the_migration_before_you_started_all_this_mess>
$ rm migrations/versions/<hex_code>_cool_migration_by_a_smart_person.py
$ alembic revision --autogenerate -m "cool migration by a smart person, for real this time"
$ alembic upgrade head

If we want, we can try to wrap this all up in the pudl_reset_db script.

If we do that, I'd want to go the "new-migration-for-each-change" route instead of the "smush-everything-into-one-migration" route. It generates a few more migration files, but I think there's a lot less space for error & complexity to creep in.

bendnorman commented 1 year ago

The "desired workflow" sounds pretty good to me! I'm not sure what you mean by "wrapping this all up in the pudl_reset_db script". Would pudl_reset_db mostly just be an alias for basic alembic commands? I don't think it's unreasonable to asks users to get familiar with alembic commands.

jdangerx commented 1 year ago

I do mean pudl_reset_db as an alias for alembic commands, and I also think that we can just get people to learn to use alembic. It's a pretty common tool so it won't feel like "ugh, I'm learning this arcane PUDL specific stuff."

I think, then, that we basically just need to update the documentation in #2523 and merge it in!

catalyst-cooperative / pudl

Warn user when a change to the PUDL DB schema is detected #2377

2378 is not in scope. I tried to do this, but because the `pudl_sqlite_io_manager` now gets created once per process, we run into a bunch of race conditions where the different processes are trying to drop/recreate tables (or, worse, delete/recreate files on disk).

catalyst-cooperative / pudl

Warn user when a change to the PUDL DB schema is detected #2377

2378 is not in scope. I tried to do this, but because the pudl_sqlite_io_manager now gets created once per process, we run into a bunch of race conditions where the different processes are trying to drop/recreate tables (or, worse, delete/recreate files on disk).

2378 is not in scope. I tried to do this, but because the `pudl_sqlite_io_manager` now gets created once per process, we run into a bunch of race conditions where the different processes are trying to drop/recreate tables (or, worse, delete/recreate files on disk).