catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
470 stars 108 forks source link

Dagster SQLite EventLog failures #3868

Open zaneselvans opened 2 days ago

zaneselvans commented 2 days ago

As previously addressed in e.g. #2417, #2996, #3003, #3208, and #3211, SQLite can't handle multiple concurrent writes. Our SQLite IO Manager has worked around this for the PUDL DB, but we seem to be hitting a new limit of some kind with Dagster's event logging DB, which still uses SQLite locally by default.

@cmgosnell recently encountered the issue in attempting to debug some integration test failures through the Dagster UI, in which it failed basically 100% of the time. I have had the problem on and off with maybe a 50-75% failure rate on the full ETL.

For me, the failure always seems to happen right after the execution of the raw_eia860m__all_dfs asset.

2024-09-25 21:43:12 -0600 - dagster - DEBUG - etl_full - 3162fc18-e15b-4284-81dd-b46c063d8d35 - 8873 - raw_eia860m__all_dfs - STEP_SUCCESS - Finished execution of step "raw_eia860m__all_dfs" in 2m35s.
Process SpawnProcess-1:
Traceback (most recent call last):
  File "/Users/zane/miniforge3/envs/pudl-dev/lib/python3.12/site-packages/sqlalchemy/engine/base.py", line 1967, in _exec_single_context
    self.dialect.do_execute(
  File "/Users/zane/miniforge3/envs/pudl-dev/lib/python3.12/site-packages/sqlalchemy/engine/default.py", line 941, in do_execute
    cursor.execute(statement, parameters)
sqlite3.OperationalError: unable to open database file

Full stack trace here:

``` 2024-09-25 21:43:12 -0600 - dagster - DEBUG - etl_full - 3162fc18-e15b-4284-81dd-b46c063d8d35 - 8873 - raw_eia860m__all_dfs - STEP_SUCCESS - Finished execution of step "raw_eia860m__all_dfs" in 2m35s. Process SpawnProcess-1: Traceback (most recent call last): File "/Users/zane/miniforge3/envs/pudl-dev/lib/python3.12/site-packages/sqlalchemy/engine/base.py", line 1967, in _exec_single_context self.dialect.do_execute( File "/Users/zane/miniforge3/envs/pudl-dev/lib/python3.12/site-packages/sqlalchemy/engine/default.py", line 941, in do_execute cursor.execute(statement, parameters) sqlite3.OperationalError: unable to open database file The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/Users/zane/miniforge3/envs/pudl-dev/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap self.run() File "/Users/zane/miniforge3/envs/pudl-dev/lib/python3.12/multiprocessing/process.py", line 108, in run self._target(*self._args, **self._kwargs) File "/Users/zane/miniforge3/envs/pudl-dev/lib/python3.12/site-packages/dagster/_grpc/impl.py", line 266, in start_run_in_subprocess _run_in_subprocess( File "/Users/zane/miniforge3/envs/pudl-dev/lib/python3.12/site-packages/dagster/_grpc/impl.py", line 250, in _run_in_subprocess instance.report_engine_event( File "/Users/zane/miniforge3/envs/pudl-dev/lib/python3.12/site-packages/dagster/_core/instance/__init__.py", line 2497, in report_engine_event self.report_dagster_event(dagster_event, run_id=run_id, log_level=log_level) File "/Users/zane/miniforge3/envs/pudl-dev/lib/python3.12/site-packages/dagster/_core/instance/__init__.py", line 2519, in report_dagster_event self.handle_new_event(event_record) File "/Users/zane/miniforge3/envs/pudl-dev/lib/python3.12/site-packages/dagster/_core/instance/__init__.py", line 2417, in handle_new_event self._event_storage.store_event(events[0]) File "/Users/zane/miniforge3/envs/pudl-dev/lib/python3.12/site-packages/dagster/_core/storage/event_log/sqlite/sqlite_event_log.py", line 242, in store_event conn.execute(insert_event_statement) File "/Users/zane/miniforge3/envs/pudl-dev/lib/python3.12/site-packages/sqlalchemy/engine/base.py", line 1418, in execute return meth( ^^^^^ File "/Users/zane/miniforge3/envs/pudl-dev/lib/python3.12/site-packages/sqlalchemy/sql/elements.py", line 515, in _execute_on_connection return connection._execute_clauseelement( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/zane/miniforge3/envs/pudl-dev/lib/python3.12/site-packages/sqlalchemy/engine/base.py", line 1640, in _execute_clauseelement ret = self._execute_context( ^^^^^^^^^^^^^^^^^^^^^^ File "/Users/zane/miniforge3/envs/pudl-dev/lib/python3.12/site-packages/sqlalchemy/engine/base.py", line 1846, in _execute_context return self._exec_single_context( ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/zane/miniforge3/envs/pudl-dev/lib/python3.12/site-packages/sqlalchemy/engine/base.py", line 1986, in _exec_single_context self._handle_dbapi_exception( File "/Users/zane/miniforge3/envs/pudl-dev/lib/python3.12/site-packages/sqlalchemy/engine/base.py", line 2355, in _handle_dbapi_exception raise sqlalchemy_exception.with_traceback(exc_info[2]) from e File "/Users/zane/miniforge3/envs/pudl-dev/lib/python3.12/site-packages/sqlalchemy/engine/base.py", line 1967, in _exec_single_context self.dialect.do_execute( File "/Users/zane/miniforge3/envs/pudl-dev/lib/python3.12/site-packages/sqlalchemy/engine/default.py", line 941, in do_execute cursor.execute(statement, parameters) sqlalchemy.exc.OperationalError: (sqlite3.OperationalError) unable to open database file [SQL: INSERT INTO event_logs (run_id, event, dagster_event_type, timestamp, step_key, asset_key, partition) VALUES (?, ?, ?, ?, ?, ?, ?)] [parameters: ('3162fc18-e15b-4284-81dd-b46c063d8d35', '{"__class__": "EventLogEntry", "dagster_event": {"__class__": "DagsterEvent", "event_specific_data": {"__class__": "EngineEventData", "error": null, ... (350 characters truncated) ... "pipeline_name": "etl_full", "run_id": "3162fc18-e15b-4284-81dd-b46c063d8d35", "step_key": null, "timestamp": 1727322130.5879638, "user_message": ""}', 'ENGINE_EVENT', '2024-09-26 03:42:10.587964', None, None, None)] (Background on this error at: https://sqlalche.me/e/20/e3q8) ```

This "unable to open database file" error seems somewhat different than the locked DB error that we were getting before due to attempted concurrent writes.

So, is there a new workaround to squeeze some more life out of SQLite? Or what is the easiest way to use Postgres locally for development?

zaneselvans commented 2 days ago

I may just be a superstitious pigeon on this but it seems like sometimes I get this failure 100% reliably, and other times it's maybe just a 50% chance, so if it fails, and I try to run the ETL again, it might work. But I don't know what influences the probability. Last night I tried to run it 10 times in a row, and they all failed within 2.5 minutes.

Some things I've explored in the past without success:

zaneselvans commented 2 days ago

Looking at the logging databases themselves: