Closed jdangerx closed 1 year ago
After adding a couple of output tables I found recreating the database every time a schema changes pretty frustrating. I wonder if instead of deleting the table data every time an asset is materialized, we drop the table, recreate the schema and write the data. I expect sqlalchemy will throw an error when the table that is a parent in a foreign key relationship is dropped.
I think using database migrations is the best option for not having to recreate the database every time. I suspect managing the overhead of migrations is worth not having to rerun the full ETL whenever schemas change.
I think that migrations are a good idea, though I'm not quite sure how we want to go about it with all the concurrency still going on. Maybe we have the IO managers each do this in their initialization:
I think that in theory, if a different process were to try to run a migration in the middle, they would be met with some sort of lockout error. And when they tried again, the migration would be a no-op.
I'm sure there are edge cases that will break that, though - off to read a bit more about how alembic works under the hood!
Another thing we could do is:
Which is definitely less error-prone than "try to coordinate several concurrent processes that are all trying to run a db migration"
I was thinking we'd run migrations outside of dagster runs so we don't have to recreate the entire database whenever a schema changes. However, that doesn't solve the problem of notifying users when the metadata has changed and they need to recreate the database or apply migrations.
I have a few potential solutions for notifying folks when the metadata has changed:
sqlite3. OperationalError
in SQLiteIOManager.handle_output()
and add an additional message like "If you recently updated resource definitions in pudl.metadata.resources, delete the pudl.sqlite database and rerun the ETL". This is fine but folks might end up clobbering their entire database when an asset dataframe does have a table schema in the database, but the actual data in the df violates the constraint in the schema. This won't break anything but might create some unnecessary headaches.SQLiteIOManager.handle_output()
. I don't think this will result in concurrency issues because the database file and metadata will have been created by the time handle_output()
is called. The comparison could happen before calling df.to_sql
or after a sqlite3. OperationalError
is thrown. After a little discussion we decided to try to initialize the DB schema (& keep it synced with our code) outside of dagster runs.
I started playing with alembic in the daz-alembic
branch, and have gotten the migrations working locally.
But, as I was writing the documentation for how to use it, it felt a little bit too complicated for our use case. Here's what we'd need to do:
alembic upgrade
in the pipeline. easy.alembic revision --autogenerate -m "some message", look over the autogenerated migration, and then run
alembic upgrade`. kind of a pain.Since we clobber all our existing tables anyways in the io_manager (or, I think so - _handle_pandas_output
has a con.execute(sa_table.delete())
), we don't need all the stuff about trying to keep the existing data around, etc.
So we could get away with a script called, like, reset_db
that:
Then the nightly build flow would remain the same, but the local build flow would look like:
reset_db
and things are hunky dory again.Pro-script:
Pro-alembic:
Despite having put some work into alembic-world, I sort of think the script situation is better suited for our needs. Thoughts @bendnorman @zaneselvans @zschira?
The alembic work in #2504 is great and works as expected!
I think the biggest benefit of using alembic right now is that we won't have to clobber the entire database every time we change the database metadata. If we think this is worth the overhead of introducing a new dependency and concept then I think we should use Alembic.
Given our fast ETL only takes 5-10 min to run, I think it's probably fine to create a script that clobbers the database, and recreates all of the schemas. If we get through converting all of these output tables and find this behavior to be a pain in the ass we can revisit database migrations.
I vote we keep the alembic branch around but go forward with @jdangerx's simple script idea. It solves our concurrency issues, lets folks know when the metadata has changed and it's simpler to manage and understand.
Also it probably makes sense to go with the simplest solution given we might transition to writing all assets to parquet files and then loading them into sqlite or duckdb in a separate job.
tl;dr:
alembic
tooling or adapt our pudl_reset_db
script to make it a little bit smoother. Which should we do? I'm leaning using the stock alembic
tooling.I did a bit more digging here after @zaneselvans ran into issues with this workflow:
DAGSTER_HOME
)Where the desired workflow is probably more like
To "run a DB migration," we can do the following:
$ alembic revision --autogenerate -m "cool migration by a smart person"`
$ alembic upgrade head
However, what if you realize you want to actually change the schema some more? There are two options:
We can just make a new migration:
$ alembic revision --autogenerate -m "cooler migration by a smarter person"
$ alembic upgrade head
Or, we can try to erase that first migration from memory and then re-generate:
$ alembic downgrade <hex_code_for_the_migration_before_you_started_all_this_mess>
$ rm migrations/versions/<hex_code>_cool_migration_by_a_smart_person.py
$ alembic revision --autogenerate -m "cool migration by a smart person, for real this time"
$ alembic upgrade head
If we want, we can try to wrap this all up in the pudl_reset_db
script.
If we do that, I'd want to go the "new-migration-for-each-change" route instead of the "smush-everything-into-one-migration" route. It generates a few more migration files, but I think there's a lot less space for error & complexity to creep in.
The "desired workflow" sounds pretty good to me! I'm not sure what you mean by "wrapping this all up in the pudl_reset_db script". Would pudl_reset_db
mostly just be an alias for basic alembic commands? I don't think it's unreasonable to asks users to get familiar with alembic commands.
I do mean pudl_reset_db
as an alias for alembic commands, and I also think that we can just get people to learn to use alembic
. It's a pretty common tool so it won't feel like "ugh, I'm learning this arcane PUDL specific stuff."
I think, then, that we basically just need to update the documentation in #2523 and merge it in!
This is because we no longer
--clobber
and drop tables; the tables are created from thePackage.from_resource_list().to_sql()
one time when the DB is created.2378 is not in scope. I tried to do this, but because the
pudl_sqlite_io_manager
now gets created once per process, we run into a bunch of race conditions where the different processes are trying to drop/recreate tables (or, worse, delete/recreate files on disk).The pysqlite transaction handling has "surprising behavior," so I'm trying to get the transactions to work following this: https://docs.sqlalchemy.org/en/14/dialects/sqlite.html#pysqlite-serializable