Open zaneselvans opened 2 days ago
I may just be a superstitious pigeon on this but it seems like sometimes I get this failure 100% reliably, and other times it's maybe just a 50% chance, so if it fails, and I try to run the ETL again, it might work. But I don't know what influences the probability. Last night I tried to run it 10 times in a row, and they all failed within 2.5 minutes.
Some things I've explored in the past without success:
pudl-dev
environment seems to fix things. But it's not clear if that's real or me imagining things.lsof
I can see that Dagster often has thousands of "files" open at once, and my system as a whole might have 20,000 open when the ETL is running.Looking at the logging databases themselves:
event_logs
table, and that is the only table that contains any records.
As previously addressed in e.g. #2417, #2996, #3003, #3208, and #3211, SQLite can't handle multiple concurrent writes. Our SQLite IO Manager has worked around this for the PUDL DB, but we seem to be hitting a new limit of some kind with Dagster's event logging DB, which still uses SQLite locally by default.
@cmgosnell recently encountered the issue in attempting to debug some integration test failures through the Dagster UI, in which it failed basically 100% of the time. I have had the problem on and off with maybe a 50-75% failure rate on the full ETL.
For me, the failure always seems to happen right after the execution of the
raw_eia860m__all_dfs
asset.Full stack trace here:
This "unable to open database file" error seems somewhat different than the locked DB error that we were getting before due to attempted concurrent writes.
So, is there a new workaround to squeeze some more life out of SQLite? Or what is the easiest way to use Postgres locally for development?
DEBUG
level logging? I don't think we really look at it!