Closed forsyth2 closed 1 year ago
@chengzhuzhang Is there any reason it would be taking so long to run E3SM Diags?
@forsyth2 When I tested on perlmutter
and compy
I don't see long time to run. compy
is about 10% longer then perlmutter
. I didn't tested on chrysalis
. how about your test on the other two machines? Are they also take significantly longer time?
@chengzhuzhang E3SM Diags runs in the expected amount of time on both Perlmutter and Compy.
Could this just be because Chrysalis is being hit pretty hard right now and the disk is slow?
@xylar I suppose that's a possibility. It would be good if it's nothing on our end. Unfortunately though, it is delaying testing since I need to get E3SM Diags results to run the tests on in the first place...
It looks like the E3SM Diags tasks completed successfully when given between 4 and 5 hours to run.
Unfortunately, I can reproduce this problem with the complete zppy test. When setting 2 hours time limit for e3sm_diags runs, jobs were cancelled due to time limit.
Do you have individual timings for different steps in e3sm_diags
? If so, can you compare a run with rc9
or rc10
to rc12
to see what changed?
I did a set of e3sm_diags runs comparing timing between rc9
and rc12
.
rc9
:
real 7m33.804s
user 45m46.510s
sys 1m23.407s
rc12
:
real 19m48.713s
user 96m8.258s
sys 10m17.319s
I don't see obvious problem, but there is a [WARNING] yaksa: 10 leaked handle pool objects
shown in the rc12
run, which is also reported here https://github.com/xCDAT/xcdat/issues/513. which is thought to come from esmf>=8.4
I will try smaller tests for timing,
Even use just minimal example:
time e3sm_diags lat_lon --no_viewer --reference_data_path '//lcrc/group/e3sm/public_html/diagnostics/observations/Atm/climatology' --test_data_path '/lcrc/group/e3sm/public_html/e3sm_diags_test_data/postprocessed_e3sm_v2_data_for_e3sm_diags/20210528.v2rc3e.piControl.ne30pg2_EC30to60E2r2.chrysalis/climatology/rgr' --results_dir '/lcrc/group/e3sm/public_html/diagnostic_output/ac.zhang40/tests/eu190rc9' --case_id 'GPCP_v3.2' --run_type 'model_vs_obs' --sets 'lat_lon' --variables 'PRECT' --seasons 'ANN' --ref_name 'GPCP_v3.2'
rc9
:
real 0m2.348s
user 0m1.775s
sys 0m0.485s
rc12
:
[WARNING] yaksa: 10 leaked handle pool objects
real 0m15.513s user 0m5.678s sys 0m14.101s
@chengzhuzhang, this suggests a pretty big change in behavior in ESMPy. Do you have a way to time just the ESMF regridding? (Is that what you already did?)
I don't think going back to ESMF 8.2.0 is a good option because that just isn't built with the dependencies that a lot of our packages rely on. I don't even think going back to 8.3.1 is an option for the same reason.
It's odd that this is only happening on Chrysalis. I suppose it could be an incompatibility of something in ESMPy with Chrysalis specifically but I can't think of what that would be.
@xylar I will try to find a way to time just ESMF regridding.
I was also curious why this only happens to Chrysalis, and went ahead testing on Perlmutter, with rc12, the complete e3sm_diags run failed prematurely with:
2023-08-20 15:37:48,537 [ERROR]: run.py(run_diags:36) >> Error traceback:
Traceback (most recent call last):
File "/global/common/software/e3sm/anaconda_envs/base/envs/e3sm_unified_1.9.0rc12_pm-cpu/lib/python3.10/site-packages/e3sm_diags/run.py", line 34, in run_diags
main(final_params)
File "/global/common/software/e3sm/anaconda_envs/base/envs/e3sm_unified_1.9.0rc12_pm-cpu/lib/python3.10/site-packages/e3sm_diags/e3sm_diags_driver.py", line 419, in main
parameters_results = _run_with_dask(parameters)
File "/global/common/software/e3sm/anaconda_envs/base/envs/e3sm_unified_1.9.0rc12_pm-cpu/lib/python3.10/site-packages/e3sm_diags/e3sm_diags_driver.py", line 365, in _run_with_dask
results = bag.map(run_diag).compute(num_workers=num_workers)
File "/global/common/software/e3sm/anaconda_envs/base/envs/e3sm_unified_1.9.0rc12_pm-cpu/lib/python3.10/site-packages/dask/base.py", line 310, in compute
(result,) = compute(self, traverse=False, **kwargs)
File "/global/common/software/e3sm/anaconda_envs/base/envs/e3sm_unified_1.9.0rc12_pm-cpu/lib/python3.10/site-packages/dask/base.py", line 595, in compute
results = schedule(dsk, keys, **kwargs)
File "/global/common/software/e3sm/anaconda_envs/base/envs/e3sm_unified_1.9.0rc12_pm-cpu/lib/python3.10/site-packages/dask/multiprocessing.py", line 233, in get
result = get_async(
File "/global/common/software/e3sm/anaconda_envs/base/envs/e3sm_unified_1.9.0rc12_pm-cpu/lib/python3.10/site-packages/dask/local.py", line 500, in get_async
for key, res_info, failed in queue_get(queue).result():
File "/global/common/software/e3sm/anaconda_envs/base/envs/e3sm_unified_1.9.0rc12_pm-cpu/lib/python3.10/concurrent/futures/_base.py", line 451, in result
return self.__get_result()
File "/global/common/software/e3sm/anaconda_envs/base/envs/e3sm_unified_1.9.0rc12_pm-cpu/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
raise self._exception
concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.
2023-08-20 15:37:48,592 [INFO]: logger.py(move_log_to_prov_dir:106) >> Log file saved in /global/cfs/cdirs/e3sm/www/chengzhu//v2_9_0_all_sets/eu190rc12/prov/e3sm_diags_run.log
real 6m39.466s
user 72m18.622s
sys 13m8.988s
But testing the same run in e3sm_diags rc3 standalone environment was fine.
I also need to confirm with @forsyth2 for the zppy test results on pm-cpu
.
My guess would be that there's some sort of bad interaction going on between dask and esmpy to do with multiprocessing, but I don't really have the time (at the moment) or the expertise to help debug it. Is there anyone else on the team we can turn to?
@chengzhuzhang Thank you for benchmarking the runtime issue.
concurrent.futures.process.BrokenProcessPool
This is happening in rc12
on Perlmutter? I thought this was a rc10
issue (https://github.com/E3SM-Project/zppy/pull/475#issuecomment-1670412094)
I also need to confirm with @forsyth2 for the zppy test results on
pm-cpu
.
I don't have any issues with E3SM Diags on pm-cpu
. The only issue I have on Perlmutter at the moment is https://github.com/E3SM-Project/zppy/issues/484, which concerns the global time series.
I also need to confirm with @forsyth2 for the zppy test results on
pm-cpu
.I don't have any issues with E3SM Diags on
pm-cpu
. The only issue I have on Perlmutter at the moment is #484, which concerns the global time series.
Thanks for confirming. Could you update permission for /global/cfs/cdirs/e3sm/forsyth//E3SMv2/v2.LR.historical_0201/, I wanted to rerun the zppy complete run on pm-cpu.
@chengzhuzhang Thank you for benchmarking the runtime issue.
concurrent.futures.process.BrokenProcessPool
This is happening in
rc12
on Perlmutter? I thought this was arc10
issue (#475 (comment))
Yes this is happening in rc12
on Perlmutter with my e3sm_diags testing.
Could you update permission for /global/cfs/cdirs/e3sm/forsyth//E3SMv2/v2.LR.historical_0201/, I wanted to rerun the zppy complete run on pm-cpu.
Yes, that should be updated now. See points 3,9 on https://github.com/E3SM-Project/zppy/issues/484#issuecomment-1687182581
Yes this is happening in
rc12
on Perlmutter with my e3sm_diags testing.
That is very strange. I'm not sure what's happening.
Could you update permission for /global/cfs/cdirs/e3sm/forsyth//E3SMv2/v2.LR.historical_0201/, I wanted to rerun the zppy complete run on pm-cpu.
Yes, that should be updated now. See points 3,9 on #484 (comment)
I still ran into permission error. I think to change owner to e3sm recursively for the directory should work.
Ok I just ran chgrp -R e3sm /global/cfs/cdirs/e3sm/forsyth//E3SMv2/v2.LR.historical_0201/
.
Ok I just ran
chgrp -R e3sm /global/cfs/cdirs/e3sm/forsyth//E3SMv2/v2.LR.historical_0201/
.
somehow I still can't access. I think chgrp -R e3sm /global/cfs/cdirs/e3sm/forsyth/E3SMv2
should do?
@chengzhuzhang Were you able to access this? I ran that command.
I am able to get everything working on Chrysalis given sufficient time, so I don't think this is a zppy
issue. And E3SM Diags does in fact still appear to work. It just takes longer.
@chengzhuzhang Were you able to access this? I ran that command.
yes, I have access to the data and updated my test results on the pm-cpu
issue.
Closing this issue, in favor of https://github.com/E3SM-Project/e3sm_diags/issues/720
Re-open this issue: https://github.com/E3SM-Project/e3sm_diags/issues/720 is resolved with e3sm_diags rc4. However, in my zppy complete test with unfied rc14
. The e3sm_diags
run still exceed the 2 hour limit.
@chengzhuzhang, can you add a bunch of timers to e3sm_diags so we can find out where the slow performance is? I suspect the remapping but it could be dask. We need to narrow it down.
From Unified RC14 testing of zppy on Chrysalis:
$ cd /lcrc/group/e3sm/ac.forsyth2/zppy_test_complete_run_output/v2.LR.historical_0201/post/scripts
$ tail -n 3 e3sm_diags_atm_monthly_180x360_aave_*.o*
==> e3sm_diags_atm_monthly_180x360_aave_environment_commands_model_vs_obs_1850-1851.o379989 <==
==============================================
Elapsed time: 47 seconds
==============================================
==> e3sm_diags_atm_monthly_180x360_aave_environment_commands_model_vs_obs_1850-1853.o379991 <==
==============================================
Elapsed time: 48 seconds
==============================================
==> e3sm_diags_atm_monthly_180x360_aave_environment_commands_model_vs_obs_1852-1853.o379990 <==
==============================================
Elapsed time: 47 seconds
==============================================
==> e3sm_diags_atm_monthly_180x360_aave_model_vs_obs_1850-1851.o379986 <==
==============================================
Elapsed time: 14658 seconds
==============================================
==> e3sm_diags_atm_monthly_180x360_aave_model_vs_obs_1850-1853.o379988 <==
==============================================
Elapsed time: 14688 seconds
==============================================
==> e3sm_diags_atm_monthly_180x360_aave_model_vs_obs_1852-1853.o379987 <==
==============================================
Elapsed time: 14662 seconds
==============================================
==> e3sm_diags_atm_monthly_180x360_aave_mvm_model_vs_model_1852-1853_vs_1850-1851.o379993 <==
==============================================
Elapsed time: 4130 seconds
==============================================
==> e3sm_diags_atm_monthly_180x360_aave_tc_analysis_model_vs_obs_1850-1851.o379992 <==
==============================================
Elapsed time: 50 seconds
==============================================
Times range from 47 to 14,688 seconds (= 244.8 minutes = 4.08 hours)
@xylar yes, I plan to do some timing profiling. Right now i'm running the e3sm_diags bash script taken from zppy but in the e3sm_diags standalone conda environment. If the timing are similar to running in unified rc14. Then we can use the e3sm_diags env for profiling. Otherwise, we may need a spack env on Chrysalis as what you created for Perlmutter to do the profiling. I will let you updated.
@chengzhuzhang, there is always a spack environment on each machine to correspond with each conda environment in E3SM-Unified. Just take a look at the contents of the load script and you can find the 2 commands for activating the spack environment.
@xylar thanks for the instruction. I parsed the time stamps from e3sm_diags logs, comparing the time taken for zppy e3sm_diags runs, between using e3sm_diags conada env and unified rc14 env. It is obvious that with unified rc14, the processes runs much slower. Using e3sm_diags conda env, it completed with 60 minutes.
I will find the spack command for performance profiling.
I'm using cProfile
to look at smaller runs on Chrysalias compute node. Running in unified rc14 does see longer timing for many type processes (e.x. saving provenance). Also I ran to intermittent MPI errors when doing profiling tests. I think it is worth to note as well.
(e3sm_unified_1.9.0rc14_chrysalis) [ac.zhang40@chr-0485 e3sm_diags_parameter]$ python -m cProfile run_lat_lon.py -d lat_lon_ANN.cfg > timing_eurc14_ANN_2
--------------------------------------------------------------------------
PMI2_Init failed to intialize. Return code: 14
--------------------------------------------------------------------------
--------------------------------------------------------------------------
The application appears to have been direct launched using "srun",
but OMPI was not built with SLURM's PMI support and therefore cannot
execute. There are several options for building PMI support under
SLURM, depending upon the SLURM version you are using:
version 16.05 or later: you can use SLURM's PMIx support. This
requires that you configure and build SLURM --with-pmix.
Versions earlier than 16.05: you must use either SLURM's PMI-1 or
PMI-2 support. SLURM builds PMI-1 by default, or you can manually
install PMI-2. You must then build Open MPI using --with-pmi pointing
to the SLURM PMI library location.
Please configure as appropriate and try again.
--------------------------------------------------------------------------
*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
[chr-0485:1617174] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
I'm uploading the cProfile results: with test on e3sm_diags conda env
and unified 14
env.
e3sm_diags conda env
cProfile results sorted by cumulative time:
11900861 function calls (11594276 primitive calls) in 38.465 seconds
Ordered by: cumulative time
ncalls tottime percall cumtime percall filename:lineno(function)
153 0.005 0.000 43.071 0.282 __init__.py:1(<module>)
1366/1 0.014 0.000 38.469 38.469 {built-in method builtins.exec}
1 0.000 0.000 38.469 38.469 run_lat_lon.py:1(<module>)
1 0.000 0.000 26.105 26.105 run.py:23(run_diags)
1 0.000 0.000 26.096 26.096 e3sm_diags_driver.py:400(main)
1378/14 0.006 0.000 13.753 0.982 <frozen importlib._bootstrap>:1022(_find_and_load)
1351/14 0.005 0.000 13.752 0.982 <frozen importlib._bootstrap>:987(_find_and_load_unlocked)
1285/15 0.005 0.000 13.742 0.916 <frozen importlib._bootstrap>:664(_load_unlocked)
1182/15 0.003 0.000 13.741 0.916 <frozen importlib._bootstrap_external>:877(exec_module)
1696/14 0.001 0.000 13.701 0.979 <frozen importlib._bootstrap>:233(_call_with_frames_removed)
744/33 0.001 0.000 13.451 0.408 {built-in method builtins.__import__}
1781/524 0.003 0.000 12.659 0.024 <frozen importlib._bootstrap>:1053(_handle_fromlist)
1 0.000 0.000 10.094 10.094 e3sm_diags_driver.py:137(save_provenance)
1 0.000 0.000 10.089 10.089 e3sm_diags_driver.py:44(_save_env_yml)
1 0.000 0.000 10.039 10.039 subprocess.py:1110(communicate)
1 0.000 0.000 10.039 10.039 subprocess.py:1968(_communicate)
2 0.000 0.000 10.038 5.019 selectors.py:403(select)
2 10.038 5.019 10.038 5.019 {method 'poll' of 'select.poll' objects}
1 0.000 0.000 9.620 9.620 main.py:114(create_viewer)
1 0.000 0.000 9.609 9.609 default_viewer.py:44(create_viewer)
1 0.001 0.001 8.708 8.708 lat_lon_viewer.py:372(generate_lat_lon_cmip6_comparison)
unified rc14
cProfile results sorted by cumulative time:
12416679 function calls (12058492 primitive calls) in 53.285 seconds
Ordered by: cumulative time
ncalls tottime percall cumtime percall filename:lineno(function)
1339/1 0.013 0.000 53.289 53.289 {built-in method builtins.exec}
1 0.000 0.000 53.289 53.289 run_lat_lon.py:1(<module>)
152 0.004 0.000 51.361 0.338 __init__.py:1(<module>)
1 0.000 0.000 38.861 38.861 run.py:23(run_diags)
1 0.000 0.000 38.851 38.851 e3sm_diags_driver.py:400(main)
1 0.000 0.000 22.114 22.114 e3sm_diags_driver.py:137(save_provenance)
1 0.000 0.000 22.091 22.091 e3sm_diags_driver.py:44(_save_env_yml)
1 0.000 0.000 22.055 22.055 subprocess.py:1110(communicate)
1 0.000 0.000 22.055 22.055 subprocess.py:1968(_communicate)
2 0.000 0.000 22.055 11.027 selectors.py:403(select)
2 22.055 11.027 22.055 11.027 {method 'poll' of 'select.poll' objects}
1366/14 0.006 0.000 15.746 1.125 <frozen importlib._bootstrap>:1022(_find_and_load)
1338/14 0.005 0.000 15.746 1.125 <frozen importlib._bootstrap>:987(_find_and_load_unlocked)
1275/15 0.005 0.000 15.737 1.049 <frozen importlib._bootstrap>:664(_load_unlocked)
1172/15 0.003 0.000 15.736 1.049 <frozen importlib._bootstrap_external>:877(exec_module)
1691/14 0.001 0.000 15.690 1.121 <frozen importlib._bootstrap>:233(_call_with_frames_removed)
747/33 0.001 0.000 15.438 0.468 {built-in method builtins.__import__}
1730/522 0.003 0.000 14.639 0.028 <frozen importlib._bootstrap>:1053(_handle_fromlist)
2 0.000 0.000 10.936 5.468 dataset.py:1(<module>)
1160 0.010 0.000 9.832 0.008 <frozen importlib._bootstrap_external>:950(get_code)
1 0.000 0.000 9.784 9.784 main.py:114(create_viewer)
1 0.000 0.000 9.773 9.773 default_viewer.py:44(create_viewer)
I'm not very experienced in interpreting cProfile results. The full results is uploaded. Maybe @tomvothecoder, @xylar and @forsyth2 can help take a look as well. I'm also making a larger run to see if we can draw more conclusive results from those. timing_edvrc4_ANN_3.txt timing_eurc14_ANN_3.txt
@chengzhuzhang I think the cProfile results should be sorted by tottime
(total time) in descending order. UPDATE: Sorting by cumulative time helps too for knowing the order of the callstack.
I don't have experience with cProfile (yet) and it seems a bit hard to parse where the lower level library calls are being made from (e.g., select.poll
). A few log lines that stand-out in the unified rc14
log include:
ncalls tottime percall cumtime percall filename:lineno(function)
2 22.055 11.027 22.055 11.027 {method 'poll' of 'select.poll' objects}
1292 9.014 0.007 9.014 0.007 {method 'read' of '_io.BufferedReader' objects}
13 6.772 0.521 6.772 0.521 {method 'acquire' of '_thread.lock' objects}
86/83 2.592 0.030 2.603 0.031 {built-in method _imp.create_dynamic}
These calls are I/O and thread related. I can see how scaling up the data size might multiply the runtime for these processes.
Python Profiling: What does "method 'poll' of 'select.poll' objects"?
@tomvothecoder thank you for inspecting the cProfile results!
However after more investigation, I tend to believe the problem may not from something internal to e3sm_diags.... I did the same profiling for much larger tasks, i.e. the whole lat-lon set and all seasons. The timing actually show very similar, around 2880 seconds, between using e3sm_diags conda env
and unified rc14
. Note that, I only tested through running interactively on compute node. I later tried to timing through zppy e3sm_diags sets one at a time through submitting slurm jobs, the slowdown only now occur, i.e. the lat-lon set won't finish in 2 hours. Together with the MPI error reported here, both led me to think the slowdown should relevant to building Slurm and MPI...
@xylar I'm totally not familiar with the building process for SLURM, MPI...But I did some search, not sure if the solution brought this link makes sense. https://lists.schedmd.com/pipermail/slurm-users/2019-March/003018.html
In the mean time, I drained my idea for more testing...
Sorry if there's some confusion but I'm not using Spack to build MPI itself. I am using the same MPI module that E3SM uses for Gnu compilers on each machine. If there is MPI trouble, I suspect it is something that is built with conda MPI trying to run using HPC MPI libraries.
Please track down what package or script is being run with MPI so we can see if it has been built with system MPI (via Slurm) or if it comes from conda forge, in which case it can't be run with MPI on our systems. This is the reason that I have to build all our MPI tools and libraries with Spack.
Are you seeing the MPI errors only with cProfile or also with normal runs? I could imagine cProfile itself being hard to use properly in this mixed spack and conda environment.
I had in mind to just add calls within e3sm_diags to get the current time at 2 spots in the code and take the difference. The old-fashioned and tedious but also very robust method for timing a code.
@mahf708 encountered the same issue on Chrysilas, copying his stack message here: "I face a specific issue on a chrysalis compute node with e3sm_diags within e3sm-unified (lates RC) above:Triggering some e3sm_diags like below returns an elaborate error. Not sure what it means really. I resolve it with simply quitting the node and restarting over. This suggests that some python processes are holding on to the locks or something?"
Actually I used the same solution to get by the problem. Restart a node from fresh would work fine.
In my testing, the MPI error occurred both with or without cProfile. Guess I can look into the old fashioned timing method.. Right now the e3sm diags run log contains time stamps, but with multiprocessing, it is not possible to isolate time spent for each processes.
In my testing, the MPI error occurred both with or without cProfile. Guess I can look into the old fashioned timing method.. Right now the e3sm diags run log contains time stamps, but with multiprocessing, it is not possible to isolate time spent for each processes.
Copying my Slack discussion with Jill
Adding the following code at the top of e3sm_diags_driver.py
should make Dask run sequentially for debugging and profiling.
import dask
dask.config.set(scheduler='synchronous') # overwrite default with single-threaded scheduler
We can look at the timestamps for the logs to see the runtime. I’m hoping running Dask sequentially mimics the same slowness if it calls the same multiprocessing modules. just on one thread/process at a time.
More info on this setting:
The single-threaded synchronous scheduler executes all computations in the local thread with no parallelism at all. This is particularly valuable for debugging and profiling, which are more difficult when using threads or processes.
For example, when using IPython or Jupyter notebooks, the %debug, %pdb, or %prun magics will not work well when using the parallel Dask schedulers (they were not designed to be used in a parallel computing context). However, if you run into an exception and want to step into the debugger, you may wish to rerun your computation under the single-threaded scheduler where these tools will function properly.
https://docs.dask.org/en/stable/scheduling.html#single-thread
@chengzhuzhang, in case it's helpful, I opened:
/lcrc/soft/climate/e3sm-unified/base/envs/e3sm_unified_1.9.0rc14_chrysalis/lib/python3.10/site-packages/e3sm_diags
up for group write permission.
I tried running via zppy
with num_workers=4
(rather than default of 24), as @golaz suggested (reasoning that the overhead of the workers may be causing an issue). Unfortunately, it looks like it's still going beyond the original 2-hour limit.
Follow-up on https://github.com/E3SM-Project/zppy/issues/485#issuecomment-1699575612 and https://github.com/E3SM-Project/zppy/issues/485#issuecomment-1703391354
With 4 workers:
tail -n 3 e3sm_diags_atm_monthly_180x360_aave_*.o*
==> e3sm_diags_atm_monthly_180x360_aave_environment_commands_model_vs_obs_1850-1851.o381747 <==
==============================================
Elapsed time: 49 seconds
==============================================
==> e3sm_diags_atm_monthly_180x360_aave_environment_commands_model_vs_obs_1850-1853.o381749 <==
==============================================
Elapsed time: 49 seconds
==============================================
==> e3sm_diags_atm_monthly_180x360_aave_environment_commands_model_vs_obs_1852-1853.o381748 <==
==============================================
Elapsed time: 49 seconds
==============================================
==> e3sm_diags_atm_monthly_180x360_aave_model_vs_obs_1850-1851.o381744 <==
==============================================
Elapsed time: 14235 seconds
==============================================
==> e3sm_diags_atm_monthly_180x360_aave_model_vs_obs_1850-1853.o381746 <==
==============================================
Elapsed time: 14287 seconds
==============================================
==> e3sm_diags_atm_monthly_180x360_aave_model_vs_obs_1852-1853.o381745 <==
==============================================
Elapsed time: 14303 seconds
==============================================
==> e3sm_diags_atm_monthly_180x360_aave_mvm_model_vs_model_1852-1853_vs_1850-1851.o381751 <==
==============================================
Elapsed time: 4186 seconds
==============================================
==> e3sm_diags_atm_monthly_180x360_aave_tc_analysis_model_vs_obs_1850-1851.o381750 <==
==============================================
Elapsed time: 49 seconds
==============================================
So, 49 seconds to 14303 seconds (=238.4 minutes = 3.97 hours). So, going from "47 to 14,688 seconds (= 244.8 minutes = 4.08 hours)", that's only a saving of ~6.4 minutes on the high end.
@forsyth2 thank you for the data points. I'm also update some of my initial investigation here. In my case, I only tested "lat_lon" because it can show the slow down alone. I use sbatch to submit the jobs to slurm. <html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">
| e3sm_diags rc4 env | e3sm unified rc14 env -- | -- | -- `scheduler='synchronous'` for Dask single-threaded scheduler | 166mins | 166mins `multiprocessing` for Dask with 24 workers | 38 mins | 126 mins
Even after setting
walltime = "4:00:00"
for the[[ atm_monthly_180x360_aave ]]
subtask intests/integration/generated/test_complete_run_chrysalis.cfg
, the following jobs still run out of time:This is a change from
walltime = "2:00:00"
in https://github.com/E3SM-Project/zppy/blob/main/tests/integration/generated/test_complete_run_chrysalis.cfg#L93It also took significantly longer to get compute nodes with this extended time limit.