Closed jdangerx closed 2 months ago
If we're confident _out_ferc714__hourly_demand_matrix
is the asset that is blowing up memory we can limit concurrentcy when the asset is run. We do this for the core_epacems__hourly_emissions
asset because using all CPUs locally destroys our computers.
Side note: I don't know if we're are using all CPUs for processing EPA CEMS on the VM. Is this intentional? In gcp_pudl_etl.sh
we're using the default number of works for the --epacems-workers
argument which is 2.
With 64MB of memory in the nightly builds, we could definitely be using more cores on EPA CEMS, but we're approaching 400 assets, and it's not deterministic which ones will be running at the same time. Right now the FERC-714 imputations and the EPA CEMS are the long poles. I really wish there were some way to chunk the EPA CEMS into a smaller memory footprint. The state-year chunks we had before were very manageable and let us do a lot more parallelism.
I wish there were some way for Dagster to explicitly manage memory usage as well as CPU usage.
Short-term, reducing the concurrency for the internally parallelized work of the FERC-714 asset seems like the simplest fix!
It's definitely simple, though it's possible some other task is also taking up a bunch of RAM and this just happened to get killed both times. Worth a shot while we figure out a better strategy re: memory usage.
I would be tempted to blame the extremely parallelized extraction of spreadsheets, which we've just added a bunch of additional work to, but that happens right at the beginning of the ETL, and this crash isn't happening for 5 hours!
Sounds good! Are there other suspects in the list of tasks that we were running when we got the OOM error?
TBH I am also very confused how the ETL can take 5 hours to get to this point. With 16 CPUs and 64 GB of memory, it's 5 hours into doing something that takes my laptop with 10 CPUs and 32 GB of memory 2 hours tops.
The only other thing in there that I could imagine being unusually resource intensive is the FERC to FERC matching step, but I don't know if that particular merge is a big memory hog.
TBH I am also very confused how the ETL can take 5 hours to get to this point. With 16 CPUs and 64 GB of memory, it's 5 hours into doing something that takes my laptop with 10 CPUs and 32 GB of memory 2 hours tops.
The ETL returns with a failure code 5 hours in, but if you look at the timestamps we see that the ETL starts at 06:08 UTC:
2024-04-03 06:08:39 +0000 - dagster - DEBUG - ferc_to_sqlite_job - 20634ea8-e16f-4b96-b880-c208266b0a27 - 102 - RUN_START - Started execution of run for "ferc_to_sqlite_job".
And then the OOM happens an hour in, at 07:11:
2024-04-03 07:11:46 +0000 - dagster - DEBUG - etl_job - be4a0899-56fa-429f-831a-93cd28c6648b - 1779 - _out_ferc714__hourly_demand_matrix - ENGINE_EVENT - Multiprocess executor: child process for step _out_ferc714__hourly_demand_matrix was terminated by signal 9 (SIGKILL). This usually indicates that the process was killed by the operating system due to running out of memory. Possible solutions include increasing the amount of memory available to the run, reducing the amount of memory used by the ops in the run, or configuring the executor to run fewer ops concurrently.
And then finally the run fails:
2024-04-03 11:00:06 +0000 - dagster - ERROR - etl_job - be4a0899-56fa-429f-831a-93cd28c6648b - 1779 - RUN_FAILURE - Execution of run for "etl_job" failed. Steps failed: ['_out_ferc714__hourly_demand_matrix'].
Oooh, interesting. So it's just the one asset materialization that fails, and everything else keeps running -- the VM doesn't crash. I wonder if we could just add a retry to this asset?
On second thought, I might be misunderstanding dagster's concurrency limit feature:
Limits can be specified for all ops/assets with a specific tag key or key-value pair. If any limit is exceeded by launching an op/asset, then the op/asset will be queued.
I think limiting pudl.etl.epacems_assets.process_single_year
using tags works because we are running this function multiple times. Our default value for CEMS is 2. Does the tag limiting capability restrict the number of processes available to the specific op
or for all ops trying to run in the job. For example, when pudl.etl.epacems_assets.process_single_year
is running can other ops be running?
This happened again on 2024-04-04, which is to be expected because #3541 wasn't merged yet. Just noting that down here for posterity.
@jdangerx do you think that #3541 would have any effect? There's already only 1 CPU being used in the asset that's getting killed.
This problem seems to be happening reliably in the nightly builds, but I've still never seen it running the full ETL locally on a machine with half as much memory. If it's just that a couple of high memory assets happen to be running at the same time could we use an op retry to just keep attempting to materialize it until there's enough free memory for it to complete? Or does the fact that it's getting killed rather than raising an exception mean that won't work. Locally this asset takes ~3 minutes to run and peaks at ~4GB of memory usage.
I think the ETL fails more often on the VM because GCP VMs do not have swap enabled by default:
Caution: You must ensure that tmpfs volumes on your instance do not use up all of your instance memory. By default, Compute Engine instances do not have any swap space. If you use all of the available instance memory in a RAM disk, the instance can crash and potentially cause you to lose your data.
I don't think it's trivial to enable it.
Overview
What do you think happened, if anything?
The
_out_ferc714__hourly_demand_matrix
job ran into an OOM kill.From the logs we can see that it started at 07:10:43 UTC and got the kill at 07:11:46 UTC.
here are the VM instance RAM metrics over the 07:00 - 07:20 timeframe, you can see that when the job started we were using ~35GB of RAM already and then we quickly jumped up to 60+ before getting killed at 07:12.
We also saw this same issue 2024-04-02, but didn't make an issue for it.
Next steps
What next steps do we need to do to understand or remediate the issue?
We could maybe force this asset to run non-concurrently.
We could maybe reduce the RAM usage of this asset.
Verify that everything is fixed!
Once you've applied any necessary fixes, make sure that the nightly build outputs are all in their right places.
Relevant logs
link to build logs from internal distribution bucket
There are several other concurrent tasks running during these two minutes:
all the dagster steps
``` > ag --no-numbers '07:1[01].*STEP' 2024-04-03-0602-c28abb2be-main-pudl-etl.log | cut -d ' ' -f1 -f2 -f15 -f17 2024-04-03 07:07:11 core_demand_side_management_eia861 STEP_WORKER_STARTED 2024-04-03 07:07:11 core_eia861__yearly_dynamic_pricing STEP_WORKER_STARTING 2024-04-03 07:10:00 extract_phmsagas STEP_WORKER_STARTING 2024-04-03 07:10:00 _core_eia860__ownership STEP_WORKER_STARTED 2024-04-03 07:10:02 _core_eia860__plants STEP_WORKER_STARTED 2024-04-03 07:10:03 _out_ferc714__hourly_demand_matrix STEP_WORKER_STARTING 2024-04-03 07:10:03 _core_eia860__ownership STEP_START 2024-04-03 07:10:03 _core_eia860__ownership STEP_INPUT 2024-04-03 07:10:04 _core_eia860__utilities STEP_WORKER_STARTED 2024-04-03 07:10:04 _core_eia860__plants STEP_START 2024-04-03 07:10:05 _core_eia860__plants STEP_INPUT 2024-04-03 07:10:06 _core_eia860__utilities STEP_START 2024-04-03 07:10:07 _core_eia860__utilities STEP_INPUT 2024-04-03 07:10:11 _core_eia860__utilities STEP_OUTPUT 2024-04-03 07:10:17 _out_ferc1__yearly_steam_plants_sched402_with_plant_ids.ferc_to_ferc.merge_steam_fuel_dfs STEP_WORKER_STARTING 2024-04-03 07:10:28 _core_eia860__plants STEP_OUTPUT 2024-04-03 07:10:29 _core_eia860__ownership STEP_OUTPUT 2024-04-03 07:10:29 _core_eia860__ownership STEP_SUCCESS 2024-04-03 07:10:30 extract_phmsagas STEP_WORKER_STARTED 2024-04-03 07:10:31 _core_eia860__plants STEP_SUCCESS 2024-04-03 07:10:36 raw_eia930__balance STEP_WORKER_STARTING 2024-04-03 07:10:37 raw_eia930__interchange STEP_WORKER_STARTING 2024-04-03 07:10:37 extract_phmsagas STEP_START 2024-04-03 07:10:40 _out_ferc714__hourly_demand_matrix STEP_WORKER_STARTED 2024-04-03 07:10:41 extract_phmsagas STEP_INPUT 2024-04-03 07:10:41 extract_phmsagas STEP_OUTPUT 2024-04-03 07:10:42 extract_phmsagas STEP_OUTPUT 2024-04-03 07:10:42 extract_phmsagas STEP_OUTPUT 2024-04-03 07:10:43 extract_phmsagas STEP_OUTPUT 2024-04-03 07:10:43 extract_phmsagas STEP_OUTPUT 2024-04-03 07:10:43 _core_eia860__boilers STEP_OUTPUT 2024-04-03 07:10:43 _out_ferc714__hourly_demand_matrix STEP_START 2024-04-03 07:10:43 _out_ferc714__hourly_demand_matrix STEP_INPUT 2024-04-03 07:10:43 extract_phmsagas STEP_OUTPUT 2024-04-03 07:10:43 extract_phmsagas STEP_OUTPUT 2024-04-03 07:10:44 extract_phmsagas STEP_OUTPUT 2024-04-03 07:10:44 extract_phmsagas STEP_OUTPUT 2024-04-03 07:10:44 extract_phmsagas STEP_OUTPUT 2024-04-03 07:10:45 extract_phmsagas STEP_OUTPUT 2024-04-03 07:10:45 extract_phmsagas STEP_OUTPUT 2024-04-03 07:10:45 extract_phmsagas STEP_OUTPUT 2024-04-03 07:10:45 extract_phmsagas STEP_OUTPUT 2024-04-03 07:10:47 raw_eia930__subregion STEP_WORKER_STARTING 2024-04-03 07:10:50 _out_ferc1__yearly_steam_plants_sched402_with_plant_ids.ferc_to_ferc.merge_steam_fuel_dfs STEP_WORKER_STARTED 2024-04-03 07:10:51 core_eia861__yearly_balancing_authority STEP_WORKER_STARTING 2024-04-03 07:10:57 raw_eia930__balance STEP_WORKER_STARTED 2024-04-03 07:10:57 raw_eia930__interchange STEP_WORKER_STARTED 2024-04-03 07:10:59 raw_eia930__balance STEP_START 2024-04-03 07:10:59 raw_eia930__interchange STEP_START 2024-04-03 07:11:00 _core_eia923__boiler_fuel STEP_WORKER_STARTING 2024-04-03 07:11:05 raw_eia930__subregion STEP_WORKER_STARTED 2024-04-03 07:11:08 raw_eia930__subregion STEP_START 2024-04-03 07:11:11 core_eia861__yearly_balancing_authority STEP_WORKER_STARTED 2024-04-03 07:11:18 core_eia861__yearly_balancing_authority STEP_START 2024-04-03 07:11:19 core_eia861__yearly_balancing_authority STEP_INPUT 2024-04-03 07:11:22 _core_eia923__boiler_fuel STEP_WORKER_STARTED 2024-04-03 07:11:23 _core_eia923__coalmine STEP_WORKER_STARTING 2024-04-03 07:11:24 _core_eia923__boiler_fuel STEP_START 2024-04-03 07:11:26 _core_eia923__boiler_fuel STEP_INPUT 2024-04-03 07:11:46 _out_ferc714__hourly_demand_matrix STEP_FAILURE 2024-04-03 07:11:46 _core_eia923__cooling_system_information STEP_WORKER_STARTING 2024-04-03 07:11:52 raw_eia930__interchange STEP_INPUT 2024-04-03 07:11:52 raw_eia930__interchange STEP_OUTPUT 2024-04-03 07:11:52 _core_eia923__boiler_fuel STEP_OUTPUT 2024-04-03 07:11:53 _core_eia923__coalmine STEP_WORKER_STARTED 2024-04-03 07:11:53 raw_eia930__balance STEP_INPUT 2024-04-03 07:11:53 raw_eia930__balance STEP_OUTPUT 2024-04-03 07:11:55 _core_eia923__coalmine STEP_START 2024-04-03 07:11:57 raw_eia930__subregion STEP_INPUT 2024-04-03 07:11:57 raw_eia930__subregion STEP_OUTPUT 2024-04-03 07:11:57 raw_eia930__subregion STEP_SUCCESS 2024-04-03 07:11:57 _core_eia923__fgd_operation_maintenance STEP_WORKER_STARTING 2024-04-03 07:11:58 _core_eia923__coalmine STEP_INPUT ```