catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
456 stars 105 forks source link

Nightly Build Failure 2024-04-03 #3533

Closed jdangerx closed 2 months ago

jdangerx commented 2 months ago

Overview

What do you think happened, if anything?

The _out_ferc714__hourly_demand_matrix job ran into an OOM kill.

From the logs we can see that it started at 07:10:43 UTC and got the kill at 07:11:46 UTC.

here are the VM instance RAM metrics over the 07:00 - 07:20 timeframe, you can see that when the job started we were using ~35GB of RAM already and then we quickly jumped up to 60+ before getting killed at 07:12.

image

We also saw this same issue 2024-04-02, but didn't make an issue for it.

Next steps

What next steps do we need to do to understand or remediate the issue?

We could maybe force this asset to run non-concurrently.

We could maybe reduce the RAM usage of this asset.

Verify that everything is fixed!

Once you've applied any necessary fixes, make sure that the nightly build outputs are all in their right places.

- [ ] [S3 distribution bucket](https://s3.console.aws.amazon.com/s3/buckets/pudl.catalyst.coop?region=us-west-2&bucketType=general&prefix=nightly/&showversions=false) was updated at the expected time
- [ ] [GCP distribution bucket](https://console.cloud.google.com/storage/browser/pudl.catalyst.coop/nightly;tab=objects?project=catalyst-cooperative-pudl) was updated at the expected time
- [ ] [GCP internal bucket](https://console.cloud.google.com/storage/browser/builds.catalyst.coop) was updated at the expected time
- [ ] [Datasette PUDL version](https://data.catalyst.coop/pudl/core_pudl__codes_datasources) points at the same hash as [nightly](https://github.com/catalyst-cooperative/pudl/tree/nightly)
- [ ] [Zenodo sandbox record](https://sandbox.zenodo.org/doi/10.5072/zenodo.5563) was updated to the record number in the logs (search for `zenodo_data_release.py` and `Draft` in the logs, to see what the new record number should be!)

Relevant logs

link to build logs from internal distribution bucket

2024-04-03 07:10:03 +0000 - dagster - DEBUG - etl_job - be4a0899-56fa-429f-831a-93cd28c6648b - 1779 - _out_ferc714__hourly_demand_matrix - STEP_WORKER_STARTING - Launching subprocess for "_out_ferc714__hourly_demand_matrix".
2024-04-03 07:10:40 +0000 - dagster - DEBUG - etl_job - be4a0899-56fa-429f-831a-93cd28c6648b - 23579 - _out_ferc714__hourly_demand_matrix - STEP_WORKER_STARTED - Executing step "_out_ferc714__hourly_demand_matrix" in subprocess.
2024-04-03 07:10:42 +0000 - dagster - DEBUG - etl_job - be4a0899-56fa-429f-831a-93cd28c6648b - 23579 - _out_ferc714__hourly_demand_matrix - RESOURCE_INIT_STARTED - Starting initialization of resources [io_manager].
2024-04-03 07:10:42 +0000 - dagster - DEBUG - etl_job - be4a0899-56fa-429f-831a-93cd28c6648b - 23579 - _out_ferc714__hourly_demand_matrix - RESOURCE_INIT_SUCCESS - Finished initialization of resources [io_manager].
2024-04-03 07:10:43 +0000 - dagster - DEBUG - etl_job - be4a0899-56fa-429f-831a-93cd28c6648b - 23579 - _out_ferc714__hourly_demand_matrix - STEP_START - Started execution of step "_out_ferc714__hourly_demand_matrix".
2024-04-03 07:11:46 +0000 - dagster - DEBUG - etl_job - be4a0899-56fa-429f-831a-93cd28c6648b - 1779 - _out_ferc714__hourly_demand_matrix - ENGINE_EVENT - Multiprocess executor: child process for step _out_ferc714__hourly_demand_matrix was terminated by signal 9 (SIGKILL). This usually indicates that the process was killed by the operating system due to running out of memory. Possible solutions include increasing the amount of memory available to the run, reducing the amount of memory used by the ops in the run, or configuring the executor to run fewer ops concurrently.

There are several other concurrent tasks running during these two minutes:

> ag --no-numbers '07:1[01].*STEP' 2024-04-03-0602-c28abb2be-main-pudl-etl.log | cut -d ' ' -f15 | sort | uniq 
_core_eia860__boilers
_core_eia860__ownership
_core_eia860__plants
_core_eia860__utilities
_core_eia923__boiler_fuel
_core_eia923__coalmine
_core_eia923__cooling_system_information
_core_eia923__fgd_operation_maintenance
_out_ferc1__yearly_steam_plants_sched402_with_plant_ids.ferc_to_ferc.merge_steam_fuel_dfs
_out_ferc714__hourly_demand_matrix
core_demand_side_management_eia861
core_eia861__yearly_balancing_authority
core_eia861__yearly_dynamic_pricing
extract_phmsagas
raw_eia930__balance
raw_eia930__interchange
raw_eia930__subregion
all the dagster steps ``` > ag --no-numbers '07:1[01].*STEP' 2024-04-03-0602-c28abb2be-main-pudl-etl.log | cut -d ' ' -f1 -f2 -f15 -f17 2024-04-03 07:07:11 core_demand_side_management_eia861 STEP_WORKER_STARTED 2024-04-03 07:07:11 core_eia861__yearly_dynamic_pricing STEP_WORKER_STARTING 2024-04-03 07:10:00 extract_phmsagas STEP_WORKER_STARTING 2024-04-03 07:10:00 _core_eia860__ownership STEP_WORKER_STARTED 2024-04-03 07:10:02 _core_eia860__plants STEP_WORKER_STARTED 2024-04-03 07:10:03 _out_ferc714__hourly_demand_matrix STEP_WORKER_STARTING 2024-04-03 07:10:03 _core_eia860__ownership STEP_START 2024-04-03 07:10:03 _core_eia860__ownership STEP_INPUT 2024-04-03 07:10:04 _core_eia860__utilities STEP_WORKER_STARTED 2024-04-03 07:10:04 _core_eia860__plants STEP_START 2024-04-03 07:10:05 _core_eia860__plants STEP_INPUT 2024-04-03 07:10:06 _core_eia860__utilities STEP_START 2024-04-03 07:10:07 _core_eia860__utilities STEP_INPUT 2024-04-03 07:10:11 _core_eia860__utilities STEP_OUTPUT 2024-04-03 07:10:17 _out_ferc1__yearly_steam_plants_sched402_with_plant_ids.ferc_to_ferc.merge_steam_fuel_dfs STEP_WORKER_STARTING 2024-04-03 07:10:28 _core_eia860__plants STEP_OUTPUT 2024-04-03 07:10:29 _core_eia860__ownership STEP_OUTPUT 2024-04-03 07:10:29 _core_eia860__ownership STEP_SUCCESS 2024-04-03 07:10:30 extract_phmsagas STEP_WORKER_STARTED 2024-04-03 07:10:31 _core_eia860__plants STEP_SUCCESS 2024-04-03 07:10:36 raw_eia930__balance STEP_WORKER_STARTING 2024-04-03 07:10:37 raw_eia930__interchange STEP_WORKER_STARTING 2024-04-03 07:10:37 extract_phmsagas STEP_START 2024-04-03 07:10:40 _out_ferc714__hourly_demand_matrix STEP_WORKER_STARTED 2024-04-03 07:10:41 extract_phmsagas STEP_INPUT 2024-04-03 07:10:41 extract_phmsagas STEP_OUTPUT 2024-04-03 07:10:42 extract_phmsagas STEP_OUTPUT 2024-04-03 07:10:42 extract_phmsagas STEP_OUTPUT 2024-04-03 07:10:43 extract_phmsagas STEP_OUTPUT 2024-04-03 07:10:43 extract_phmsagas STEP_OUTPUT 2024-04-03 07:10:43 _core_eia860__boilers STEP_OUTPUT 2024-04-03 07:10:43 _out_ferc714__hourly_demand_matrix STEP_START 2024-04-03 07:10:43 _out_ferc714__hourly_demand_matrix STEP_INPUT 2024-04-03 07:10:43 extract_phmsagas STEP_OUTPUT 2024-04-03 07:10:43 extract_phmsagas STEP_OUTPUT 2024-04-03 07:10:44 extract_phmsagas STEP_OUTPUT 2024-04-03 07:10:44 extract_phmsagas STEP_OUTPUT 2024-04-03 07:10:44 extract_phmsagas STEP_OUTPUT 2024-04-03 07:10:45 extract_phmsagas STEP_OUTPUT 2024-04-03 07:10:45 extract_phmsagas STEP_OUTPUT 2024-04-03 07:10:45 extract_phmsagas STEP_OUTPUT 2024-04-03 07:10:45 extract_phmsagas STEP_OUTPUT 2024-04-03 07:10:47 raw_eia930__subregion STEP_WORKER_STARTING 2024-04-03 07:10:50 _out_ferc1__yearly_steam_plants_sched402_with_plant_ids.ferc_to_ferc.merge_steam_fuel_dfs STEP_WORKER_STARTED 2024-04-03 07:10:51 core_eia861__yearly_balancing_authority STEP_WORKER_STARTING 2024-04-03 07:10:57 raw_eia930__balance STEP_WORKER_STARTED 2024-04-03 07:10:57 raw_eia930__interchange STEP_WORKER_STARTED 2024-04-03 07:10:59 raw_eia930__balance STEP_START 2024-04-03 07:10:59 raw_eia930__interchange STEP_START 2024-04-03 07:11:00 _core_eia923__boiler_fuel STEP_WORKER_STARTING 2024-04-03 07:11:05 raw_eia930__subregion STEP_WORKER_STARTED 2024-04-03 07:11:08 raw_eia930__subregion STEP_START 2024-04-03 07:11:11 core_eia861__yearly_balancing_authority STEP_WORKER_STARTED 2024-04-03 07:11:18 core_eia861__yearly_balancing_authority STEP_START 2024-04-03 07:11:19 core_eia861__yearly_balancing_authority STEP_INPUT 2024-04-03 07:11:22 _core_eia923__boiler_fuel STEP_WORKER_STARTED 2024-04-03 07:11:23 _core_eia923__coalmine STEP_WORKER_STARTING 2024-04-03 07:11:24 _core_eia923__boiler_fuel STEP_START 2024-04-03 07:11:26 _core_eia923__boiler_fuel STEP_INPUT 2024-04-03 07:11:46 _out_ferc714__hourly_demand_matrix STEP_FAILURE 2024-04-03 07:11:46 _core_eia923__cooling_system_information STEP_WORKER_STARTING 2024-04-03 07:11:52 raw_eia930__interchange STEP_INPUT 2024-04-03 07:11:52 raw_eia930__interchange STEP_OUTPUT 2024-04-03 07:11:52 _core_eia923__boiler_fuel STEP_OUTPUT 2024-04-03 07:11:53 _core_eia923__coalmine STEP_WORKER_STARTED 2024-04-03 07:11:53 raw_eia930__balance STEP_INPUT 2024-04-03 07:11:53 raw_eia930__balance STEP_OUTPUT 2024-04-03 07:11:55 _core_eia923__coalmine STEP_START 2024-04-03 07:11:57 raw_eia930__subregion STEP_INPUT 2024-04-03 07:11:57 raw_eia930__subregion STEP_OUTPUT 2024-04-03 07:11:57 raw_eia930__subregion STEP_SUCCESS 2024-04-03 07:11:57 _core_eia923__fgd_operation_maintenance STEP_WORKER_STARTING 2024-04-03 07:11:58 _core_eia923__coalmine STEP_INPUT ```
bendnorman commented 2 months ago

If we're confident _out_ferc714__hourly_demand_matrix is the asset that is blowing up memory we can limit concurrentcy when the asset is run. We do this for the core_epacems__hourly_emissions asset because using all CPUs locally destroys our computers.

Side note: I don't know if we're are using all CPUs for processing EPA CEMS on the VM. Is this intentional? In gcp_pudl_etl.sh we're using the default number of works for the --epacems-workers argument which is 2.

zaneselvans commented 2 months ago

With 64MB of memory in the nightly builds, we could definitely be using more cores on EPA CEMS, but we're approaching 400 assets, and it's not deterministic which ones will be running at the same time. Right now the FERC-714 imputations and the EPA CEMS are the long poles. I really wish there were some way to chunk the EPA CEMS into a smaller memory footprint. The state-year chunks we had before were very manageable and let us do a lot more parallelism.

I wish there were some way for Dagster to explicitly manage memory usage as well as CPU usage.

zaneselvans commented 2 months ago

Short-term, reducing the concurrency for the internally parallelized work of the FERC-714 asset seems like the simplest fix!

jdangerx commented 2 months ago

It's definitely simple, though it's possible some other task is also taking up a bunch of RAM and this just happened to get killed both times. Worth a shot while we figure out a better strategy re: memory usage.

zaneselvans commented 2 months ago

I would be tempted to blame the extremely parallelized extraction of spreadsheets, which we've just added a bunch of additional work to, but that happens right at the beginning of the ETL, and this crash isn't happening for 5 hours!

bendnorman commented 2 months ago

Sounds good! Are there other suspects in the list of tasks that we were running when we got the OOM error?

zaneselvans commented 2 months ago

TBH I am also very confused how the ETL can take 5 hours to get to this point. With 16 CPUs and 64 GB of memory, it's 5 hours into doing something that takes my laptop with 10 CPUs and 32 GB of memory 2 hours tops.

zaneselvans commented 2 months ago

The only other thing in there that I could imagine being unusually resource intensive is the FERC to FERC matching step, but I don't know if that particular merge is a big memory hog.

jdangerx commented 2 months ago

TBH I am also very confused how the ETL can take 5 hours to get to this point. With 16 CPUs and 64 GB of memory, it's 5 hours into doing something that takes my laptop with 10 CPUs and 32 GB of memory 2 hours tops.

The ETL returns with a failure code 5 hours in, but if you look at the timestamps we see that the ETL starts at 06:08 UTC:

2024-04-03 06:08:39 +0000 - dagster - DEBUG - ferc_to_sqlite_job - 20634ea8-e16f-4b96-b880-c208266b0a27 - 102 - RUN_START - Started execution of run for "ferc_to_sqlite_job".

And then the OOM happens an hour in, at 07:11:

2024-04-03 07:11:46 +0000 - dagster - DEBUG - etl_job - be4a0899-56fa-429f-831a-93cd28c6648b - 1779 - _out_ferc714__hourly_demand_matrix - ENGINE_EVENT - Multiprocess executor: child process for step _out_ferc714__hourly_demand_matrix was terminated by signal 9 (SIGKILL). This usually indicates that the process was killed by the operating system due to running out of memory. Possible solutions include increasing the amount of memory available to the run, reducing the amount of memory used by the ops in the run, or configuring the executor to run fewer ops concurrently.

And then finally the run fails:

2024-04-03 11:00:06 +0000 - dagster - ERROR - etl_job - be4a0899-56fa-429f-831a-93cd28c6648b - 1779 - RUN_FAILURE - Execution of run for "etl_job" failed. Steps failed: ['_out_ferc714__hourly_demand_matrix'].
zaneselvans commented 2 months ago

Oooh, interesting. So it's just the one asset materialization that fails, and everything else keeps running -- the VM doesn't crash. I wonder if we could just add a retry to this asset?

bendnorman commented 2 months ago

On second thought, I might be misunderstanding dagster's concurrency limit feature:

Limits can be specified for all ops/assets with a specific tag key or key-value pair. If any limit is exceeded by launching an op/asset, then the op/asset will be queued.

I think limiting pudl.etl.epacems_assets.process_single_year using tags works because we are running this function multiple times. Our default value for CEMS is 2. Does the tag limiting capability restrict the number of processes available to the specific op or for all ops trying to run in the job. For example, when pudl.etl.epacems_assets.process_single_year is running can other ops be running?

jdangerx commented 2 months ago

This happened again on 2024-04-04, which is to be expected because #3541 wasn't merged yet. Just noting that down here for posterity.

zaneselvans commented 2 months ago

@jdangerx do you think that #3541 would have any effect? There's already only 1 CPU being used in the asset that's getting killed.

This problem seems to be happening reliably in the nightly builds, but I've still never seen it running the full ETL locally on a machine with half as much memory. If it's just that a couple of high memory assets happen to be running at the same time could we use an op retry to just keep attempting to materialize it until there's enough free memory for it to complete? Or does the fact that it's getting killed rather than raising an exception mean that won't work. Locally this asset takes ~3 minutes to run and peaks at ~4GB of memory usage.

bendnorman commented 2 months ago

I think the ETL fails more often on the VM because GCP VMs do not have swap enabled by default:

Caution: You must ensure that tmpfs volumes on your instance do not use up all of your instance memory. By default, Compute Engine instances do not have any swap space. If you use all of the available instance memory in a RAM disk, the instance can crash and potentially cause you to lose your data.

I don't think it's trivial to enable it.