Unified rc12 and rc14 testing: `e3sm_diags` takes too long on Chrysalis

forsyth2 commented 1 year ago

Even after setting walltime = "4:00:00" for the [[ atm_monthly_180x360_aave ]] subtask in tests/integration/generated/test_complete_run_chrysalis.cfg, the following jobs still run out of time:

e3sm_diags_atm_monthly_180x360_aave_model_vs_obs_1850-1851
e3sm_diags_atm_monthly_180x360_aave_model_vs_obs_1850-1853
e3sm_diags_atm_monthly_180x360_aave_model_vs_obs_1852-1853

This is a change from walltime = "2:00:00" in https://github.com/E3SM-Project/zppy/blob/main/tests/integration/generated/test_complete_run_chrysalis.cfg#L93

It also took significantly longer to get compute nodes with this extended time limit.

forsyth2 commented 1 year ago

@chengzhuzhang Is there any reason it would be taking so long to run E3SM Diags?

chengzhuzhang commented 1 year ago

@forsyth2 When I tested on perlmutter and compy I don't see long time to run. compy is about 10% longer then perlmutter. I didn't tested on chrysalis. how about your test on the other two machines? Are they also take significantly longer time?

forsyth2 commented 1 year ago

@chengzhuzhang E3SM Diags runs in the expected amount of time on both Perlmutter and Compy.

xylar commented 1 year ago

Could this just be because Chrysalis is being hit pretty hard right now and the disk is slow?

forsyth2 commented 1 year ago

@xylar I suppose that's a possibility. It would be good if it's nothing on our end. Unfortunately though, it is delaying testing since I need to get E3SM Diags results to run the tests on in the first place...

forsyth2 commented 1 year ago

It looks like the E3SM Diags tasks completed successfully when given between 4 and 5 hours to run.

chengzhuzhang commented 1 year ago

Unfortunately, I can reproduce this problem with the complete zppy test. When setting 2 hours time limit for e3sm_diags runs, jobs were cancelled due to time limit.

xylar commented 1 year ago

Do you have individual timings for different steps in e3sm_diags? If so, can you compare a run with rc9 or rc10 to rc12 to see what changed?

chengzhuzhang commented 1 year ago

I did a set of e3sm_diags runs comparing timing between rc9 and rc12. rc9: real 7m33.804s user 45m46.510s sys 1m23.407s

rc12: real 19m48.713s user 96m8.258s sys 10m17.319s

I don't see obvious problem, but there is a [WARNING] yaksa: 10 leaked handle pool objects shown in the rc12 run, which is also reported here https://github.com/xCDAT/xcdat/issues/513. which is thought to come from esmf>=8.4

I will try smaller tests for timing,

chengzhuzhang commented 1 year ago

Even use just minimal example:

time e3sm_diags lat_lon --no_viewer --reference_data_path '//lcrc/group/e3sm/public_html/diagnostics/observations/Atm/climatology' --test_data_path '/lcrc/group/e3sm/public_html/e3sm_diags_test_data/postprocessed_e3sm_v2_data_for_e3sm_diags/20210528.v2rc3e.piControl.ne30pg2_EC30to60E2r2.chrysalis/climatology/rgr' --results_dir '/lcrc/group/e3sm/public_html/diagnostic_output/ac.zhang40/tests/eu190rc9' --case_id 'GPCP_v3.2' --run_type 'model_vs_obs' --sets 'lat_lon' --variables 'PRECT' --seasons 'ANN' --ref_name 'GPCP_v3.2'

rc9: real 0m2.348s user 0m1.775s sys 0m0.485s

rc12: [WARNING] yaksa: 10 leaked handle pool objects

real 0m15.513s user 0m5.678s sys 0m14.101s

xylar commented 1 year ago

@chengzhuzhang, this suggests a pretty big change in behavior in ESMPy. Do you have a way to time just the ESMF regridding? (Is that what you already did?)

I don't think going back to ESMF 8.2.0 is a good option because that just isn't built with the dependencies that a lot of our packages rely on. I don't even think going back to 8.3.1 is an option for the same reason.

It's odd that this is only happening on Chrysalis. I suppose it could be an incompatibility of something in ESMPy with Chrysalis specifically but I can't think of what that would be.

chengzhuzhang commented 1 year ago

@xylar I will try to find a way to time just ESMF regridding.

I was also curious why this only happens to Chrysalis, and went ahead testing on Perlmutter, with rc12, the complete e3sm_diags run failed prematurely with:

2023-08-20 15:37:48,537 [ERROR]: run.py(run_diags:36) >> Error traceback:
Traceback (most recent call last):
  File "/global/common/software/e3sm/anaconda_envs/base/envs/e3sm_unified_1.9.0rc12_pm-cpu/lib/python3.10/site-packages/e3sm_diags/run.py", line 34, in run_diags
    main(final_params)
  File "/global/common/software/e3sm/anaconda_envs/base/envs/e3sm_unified_1.9.0rc12_pm-cpu/lib/python3.10/site-packages/e3sm_diags/e3sm_diags_driver.py", line 419, in main
    parameters_results = _run_with_dask(parameters)
  File "/global/common/software/e3sm/anaconda_envs/base/envs/e3sm_unified_1.9.0rc12_pm-cpu/lib/python3.10/site-packages/e3sm_diags/e3sm_diags_driver.py", line 365, in _run_with_dask
    results = bag.map(run_diag).compute(num_workers=num_workers)
  File "/global/common/software/e3sm/anaconda_envs/base/envs/e3sm_unified_1.9.0rc12_pm-cpu/lib/python3.10/site-packages/dask/base.py", line 310, in compute
    (result,) = compute(self, traverse=False, **kwargs)
  File "/global/common/software/e3sm/anaconda_envs/base/envs/e3sm_unified_1.9.0rc12_pm-cpu/lib/python3.10/site-packages/dask/base.py", line 595, in compute
    results = schedule(dsk, keys, **kwargs)
  File "/global/common/software/e3sm/anaconda_envs/base/envs/e3sm_unified_1.9.0rc12_pm-cpu/lib/python3.10/site-packages/dask/multiprocessing.py", line 233, in get
    result = get_async(
  File "/global/common/software/e3sm/anaconda_envs/base/envs/e3sm_unified_1.9.0rc12_pm-cpu/lib/python3.10/site-packages/dask/local.py", line 500, in get_async
    for key, res_info, failed in queue_get(queue).result():
  File "/global/common/software/e3sm/anaconda_envs/base/envs/e3sm_unified_1.9.0rc12_pm-cpu/lib/python3.10/concurrent/futures/_base.py", line 451, in result
    return self.__get_result()
  File "/global/common/software/e3sm/anaconda_envs/base/envs/e3sm_unified_1.9.0rc12_pm-cpu/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.
2023-08-20 15:37:48,592 [INFO]: logger.py(move_log_to_prov_dir:106) >> Log file saved in /global/cfs/cdirs/e3sm/www/chengzhu//v2_9_0_all_sets/eu190rc12/prov/e3sm_diags_run.log

real    6m39.466s
user    72m18.622s
sys 13m8.988s

But testing the same run in e3sm_diags rc3 standalone environment was fine.

I also need to confirm with @forsyth2 for the zppy test results on pm-cpu.

xylar commented 1 year ago

My guess would be that there's some sort of bad interaction going on between dask and esmpy to do with multiprocessing, but I don't really have the time (at the moment) or the expertise to help debug it. Is there anyone else on the team we can turn to?

forsyth2 commented 1 year ago

@chengzhuzhang Thank you for benchmarking the runtime issue.

concurrent.futures.process.BrokenProcessPool

This is happening in rc12 on Perlmutter? I thought this was a rc10 issue (https://github.com/E3SM-Project/zppy/pull/475#issuecomment-1670412094)

I also need to confirm with @forsyth2 for the zppy test results on pm-cpu.

I don't have any issues with E3SM Diags on pm-cpu. The only issue I have on Perlmutter at the moment is https://github.com/E3SM-Project/zppy/issues/484, which concerns the global time series.

chengzhuzhang commented 1 year ago

I also need to confirm with @forsyth2 for the zppy test results on pm-cpu.

I don't have any issues with E3SM Diags on pm-cpu. The only issue I have on Perlmutter at the moment is #484, which concerns the global time series.

Thanks for confirming. Could you update permission for /global/cfs/cdirs/e3sm/forsyth//E3SMv2/v2.LR.historical_0201/, I wanted to rerun the zppy complete run on pm-cpu.

chengzhuzhang commented 1 year ago

@chengzhuzhang Thank you for benchmarking the runtime issue.

concurrent.futures.process.BrokenProcessPool

This is happening in rc12 on Perlmutter? I thought this was a rc10 issue (#475 (comment))

Yes this is happening in rc12 on Perlmutter with my e3sm_diags testing.

forsyth2 commented 1 year ago

Could you update permission for /global/cfs/cdirs/e3sm/forsyth//E3SMv2/v2.LR.historical_0201/, I wanted to rerun the zppy complete run on pm-cpu.

Yes, that should be updated now. See points 3,9 on https://github.com/E3SM-Project/zppy/issues/484#issuecomment-1687182581

Yes this is happening in rc12 on Perlmutter with my e3sm_diags testing.

That is very strange. I'm not sure what's happening.

chengzhuzhang commented 1 year ago

Could you update permission for /global/cfs/cdirs/e3sm/forsyth//E3SMv2/v2.LR.historical_0201/, I wanted to rerun the zppy complete run on pm-cpu.

Yes, that should be updated now. See points 3,9 on #484 (comment)

I still ran into permission error. I think to change owner to e3sm recursively for the directory should work.

forsyth2 commented 1 year ago

Ok I just ran chgrp -R e3sm /global/cfs/cdirs/e3sm/forsyth//E3SMv2/v2.LR.historical_0201/.

chengzhuzhang commented 1 year ago

Ok I just ran chgrp -R e3sm /global/cfs/cdirs/e3sm/forsyth//E3SMv2/v2.LR.historical_0201/.

somehow I still can't access. I think chgrp -R e3sm /global/cfs/cdirs/e3sm/forsyth/E3SMv2 should do?

forsyth2 commented 1 year ago

@chengzhuzhang Were you able to access this? I ran that command.

forsyth2 commented 1 year ago

I am able to get everything working on Chrysalis given sufficient time, so I don't think this is a zppy issue. And E3SM Diags does in fact still appear to work. It just takes longer.

chengzhuzhang commented 1 year ago

@chengzhuzhang Were you able to access this? I ran that command.

yes, I have access to the data and updated my test results on the pm-cpu issue.

forsyth2 commented 1 year ago

Closing this issue, in favor of https://github.com/E3SM-Project/e3sm_diags/issues/720

chengzhuzhang commented 1 year ago

Re-open this issue: https://github.com/E3SM-Project/e3sm_diags/issues/720 is resolved with e3sm_diags rc4. However, in my zppy complete test with unfied rc14. The e3sm_diags run still exceed the 2 hour limit.

xylar commented 1 year ago

@chengzhuzhang, can you add a bunch of timers to e3sm_diags so we can find out where the slow performance is? I suspect the remapping but it could be dask. We need to narrow it down.

forsyth2 commented 1 year ago

From Unified RC14 testing of zppy on Chrysalis:

$ cd /lcrc/group/e3sm/ac.forsyth2/zppy_test_complete_run_output/v2.LR.historical_0201/post/scripts
$ tail -n 3 e3sm_diags_atm_monthly_180x360_aave_*.o*
==> e3sm_diags_atm_monthly_180x360_aave_environment_commands_model_vs_obs_1850-1851.o379989 <==
==============================================
Elapsed time: 47 seconds
==============================================

==> e3sm_diags_atm_monthly_180x360_aave_environment_commands_model_vs_obs_1850-1853.o379991 <==
==============================================
Elapsed time: 48 seconds
==============================================

==> e3sm_diags_atm_monthly_180x360_aave_environment_commands_model_vs_obs_1852-1853.o379990 <==
==============================================
Elapsed time: 47 seconds
==============================================

==> e3sm_diags_atm_monthly_180x360_aave_model_vs_obs_1850-1851.o379986 <==
==============================================
Elapsed time: 14658 seconds
==============================================

==> e3sm_diags_atm_monthly_180x360_aave_model_vs_obs_1850-1853.o379988 <==
==============================================
Elapsed time: 14688 seconds
==============================================

==> e3sm_diags_atm_monthly_180x360_aave_model_vs_obs_1852-1853.o379987 <==
==============================================
Elapsed time: 14662 seconds
==============================================

==> e3sm_diags_atm_monthly_180x360_aave_mvm_model_vs_model_1852-1853_vs_1850-1851.o379993 <==
==============================================
Elapsed time: 4130 seconds
==============================================

==> e3sm_diags_atm_monthly_180x360_aave_tc_analysis_model_vs_obs_1850-1851.o379992 <==
==============================================
Elapsed time: 50 seconds
==============================================

Times range from 47 to 14,688 seconds (= 244.8 minutes = 4.08 hours)

chengzhuzhang commented 1 year ago

@xylar yes, I plan to do some timing profiling. Right now i'm running the e3sm_diags bash script taken from zppy but in the e3sm_diags standalone conda environment. If the timing are similar to running in unified rc14. Then we can use the e3sm_diags env for profiling. Otherwise, we may need a spack env on Chrysalis as what you created for Perlmutter to do the profiling. I will let you updated.

xylar commented 1 year ago

@chengzhuzhang, there is always a spack environment on each machine to correspond with each conda environment in E3SM-Unified. Just take a look at the contents of the load script and you can find the 2 commands for activating the spack environment.

chengzhuzhang commented 1 year ago

@xylar thanks for the instruction. I parsed the time stamps from e3sm_diags logs, comparing the time taken for zppy e3sm_diags runs, between using e3sm_diags conada env and unified rc14 env. It is obvious that with unified rc14, the processes runs much slower. Using e3sm_diags conda env, it completed with 60 minutes.

I will find the spack command for performance profiling.

chengzhuzhang commented 1 year ago

I'm using cProfile to look at smaller runs on Chrysalias compute node. Running in unified rc14 does see longer timing for many type processes (e.x. saving provenance). Also I ran to intermittent MPI errors when doing profiling tests. I think it is worth to note as well.

(e3sm_unified_1.9.0rc14_chrysalis) [ac.zhang40@chr-0485 e3sm_diags_parameter]$ python -m cProfile run_lat_lon.py -d lat_lon_ANN.cfg > timing_eurc14_ANN_2
--------------------------------------------------------------------------
PMI2_Init failed to intialize.  Return code: 14
--------------------------------------------------------------------------
--------------------------------------------------------------------------
The application appears to have been direct launched using "srun",
but OMPI was not built with SLURM's PMI support and therefore cannot
execute. There are several options for building PMI support under
SLURM, depending upon the SLURM version you are using:

  version 16.05 or later: you can use SLURM's PMIx support. This
  requires that you configure and build SLURM --with-pmix.

  Versions earlier than 16.05: you must use either SLURM's PMI-1 or
  PMI-2 support. SLURM builds PMI-1 by default, or you can manually
  install PMI-2. You must then build Open MPI using --with-pmi pointing
  to the SLURM PMI library location.

Please configure as appropriate and try again.
--------------------------------------------------------------------------
*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[chr-0485:1617174] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!

chengzhuzhang commented 1 year ago

I'm uploading the cProfile results: with test on e3sm_diags conda env and unified 14 env. e3sm_diags conda env cProfile results sorted by cumulative time:

         11900861 function calls (11594276 primitive calls) in 38.465 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      153    0.005    0.000   43.071    0.282 __init__.py:1(<module>)
   1366/1    0.014    0.000   38.469   38.469 {built-in method builtins.exec}
        1    0.000    0.000   38.469   38.469 run_lat_lon.py:1(<module>)
        1    0.000    0.000   26.105   26.105 run.py:23(run_diags)
        1    0.000    0.000   26.096   26.096 e3sm_diags_driver.py:400(main)
  1378/14    0.006    0.000   13.753    0.982 <frozen importlib._bootstrap>:1022(_find_and_load)
  1351/14    0.005    0.000   13.752    0.982 <frozen importlib._bootstrap>:987(_find_and_load_unlocked)
  1285/15    0.005    0.000   13.742    0.916 <frozen importlib._bootstrap>:664(_load_unlocked)
  1182/15    0.003    0.000   13.741    0.916 <frozen importlib._bootstrap_external>:877(exec_module)
  1696/14    0.001    0.000   13.701    0.979 <frozen importlib._bootstrap>:233(_call_with_frames_removed)
   744/33    0.001    0.000   13.451    0.408 {built-in method builtins.__import__}
 1781/524    0.003    0.000   12.659    0.024 <frozen importlib._bootstrap>:1053(_handle_fromlist)
        1    0.000    0.000   10.094   10.094 e3sm_diags_driver.py:137(save_provenance)
        1    0.000    0.000   10.089   10.089 e3sm_diags_driver.py:44(_save_env_yml)
        1    0.000    0.000   10.039   10.039 subprocess.py:1110(communicate)
        1    0.000    0.000   10.039   10.039 subprocess.py:1968(_communicate)
        2    0.000    0.000   10.038    5.019 selectors.py:403(select)
        2   10.038    5.019   10.038    5.019 {method 'poll' of 'select.poll' objects}
        1    0.000    0.000    9.620    9.620 main.py:114(create_viewer)
        1    0.000    0.000    9.609    9.609 default_viewer.py:44(create_viewer)
        1    0.001    0.001    8.708    8.708 lat_lon_viewer.py:372(generate_lat_lon_cmip6_comparison)

unified rc14 cProfile results sorted by cumulative time:

         12416679 function calls (12058492 primitive calls) in 53.285 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   1339/1    0.013    0.000   53.289   53.289 {built-in method builtins.exec}
        1    0.000    0.000   53.289   53.289 run_lat_lon.py:1(<module>)
      152    0.004    0.000   51.361    0.338 __init__.py:1(<module>)
        1    0.000    0.000   38.861   38.861 run.py:23(run_diags)
        1    0.000    0.000   38.851   38.851 e3sm_diags_driver.py:400(main)
        1    0.000    0.000   22.114   22.114 e3sm_diags_driver.py:137(save_provenance)
        1    0.000    0.000   22.091   22.091 e3sm_diags_driver.py:44(_save_env_yml)
        1    0.000    0.000   22.055   22.055 subprocess.py:1110(communicate)
        1    0.000    0.000   22.055   22.055 subprocess.py:1968(_communicate)
        2    0.000    0.000   22.055   11.027 selectors.py:403(select)
        2   22.055   11.027   22.055   11.027 {method 'poll' of 'select.poll' objects}
  1366/14    0.006    0.000   15.746    1.125 <frozen importlib._bootstrap>:1022(_find_and_load)
  1338/14    0.005    0.000   15.746    1.125 <frozen importlib._bootstrap>:987(_find_and_load_unlocked)
  1275/15    0.005    0.000   15.737    1.049 <frozen importlib._bootstrap>:664(_load_unlocked)
  1172/15    0.003    0.000   15.736    1.049 <frozen importlib._bootstrap_external>:877(exec_module)
  1691/14    0.001    0.000   15.690    1.121 <frozen importlib._bootstrap>:233(_call_with_frames_removed)
   747/33    0.001    0.000   15.438    0.468 {built-in method builtins.__import__}
 1730/522    0.003    0.000   14.639    0.028 <frozen importlib._bootstrap>:1053(_handle_fromlist)
        2    0.000    0.000   10.936    5.468 dataset.py:1(<module>)
     1160    0.010    0.000    9.832    0.008 <frozen importlib._bootstrap_external>:950(get_code)
        1    0.000    0.000    9.784    9.784 main.py:114(create_viewer)
        1    0.000    0.000    9.773    9.773 default_viewer.py:44(create_viewer)

I'm not very experienced in interpreting cProfile results. The full results is uploaded. Maybe @tomvothecoder, @xylar and @forsyth2 can help take a look as well. I'm also making a larger run to see if we can draw more conclusive results from those. timing_edvrc4_ANN_3.txt timing_eurc14_ANN_3.txt

tomvothecoder commented 1 year ago

@chengzhuzhang I think the cProfile results should be sorted by tottime (total time) in descending order. UPDATE: Sorting by cumulative time helps too for knowing the order of the callstack.

I don't have experience with cProfile (yet) and it seems a bit hard to parse where the lower level library calls are being made from (e.g., select.poll). A few log lines that stand-out in the unified rc14 log include:

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        2   22.055   11.027   22.055   11.027 {method 'poll' of 'select.poll' objects}
     1292    9.014    0.007    9.014    0.007 {method 'read' of '_io.BufferedReader' objects}
       13    6.772    0.521    6.772    0.521 {method 'acquire' of '_thread.lock' objects}
    86/83    2.592    0.030    2.603    0.031 {built-in method _imp.create_dynamic}

These calls are I/O and thread related. I can see how scaling up the data size might multiply the runtime for these processes.

Python Profiling: What does "method 'poll' of 'select.poll' objects"?

Methods that execute inside a different process (for example with a ProcessPoolExecutor) are not captured by cProfile. So the select.poll is just showing your main process waiting for results from the other processes.

select — Waiting for I/O completion
- This module provides access to the select() and poll() functions available in most operating systems, devpoll() available on Solaris and derivatives, epoll() available on Linux 2.5+ and kqueue() available on most BSD. Note that on Windows, it only works for sockets; on other operating systems, it also works for other file types (in particular, on Unix, it works on pipes). It cannot be used on regular files to determine whether a file has grown since it was last read.
- select.poll - (Not supported by all operating systems.) Returns a polling object, which supports registering and unregistering file descriptors, and then polling them for I/O events; see section Polling Objects below for the methods supported by polling objects.
lock.acquire(blocking=True, timeout=- 1)
- Without any optional argument, this method acquires the lock unconditionally, if necessary waiting until it is released by another thread (only one thread at a time can acquire a lock — that’s their reason for existence).

chengzhuzhang commented 1 year ago

@tomvothecoder thank you for inspecting the cProfile results!

However after more investigation, I tend to believe the problem may not from something internal to e3sm_diags.... I did the same profiling for much larger tasks, i.e. the whole lat-lon set and all seasons. The timing actually show very similar, around 2880 seconds, between using e3sm_diags conda env and unified rc14. Note that, I only tested through running interactively on compute node. I later tried to timing through zppy e3sm_diags sets one at a time through submitting slurm jobs, the slowdown only now occur, i.e. the lat-lon set won't finish in 2 hours. Together with the MPI error reported here, both led me to think the slowdown should relevant to building Slurm and MPI...

chengzhuzhang commented 1 year ago

@xylar I'm totally not familiar with the building process for SLURM, MPI...But I did some search, not sure if the solution brought this link makes sense. https://lists.schedmd.com/pipermail/slurm-users/2019-March/003018.html

In the mean time, I drained my idea for more testing...

xylar commented 1 year ago

Sorry if there's some confusion but I'm not using Spack to build MPI itself. I am using the same MPI module that E3SM uses for Gnu compilers on each machine. If there is MPI trouble, I suspect it is something that is built with conda MPI trying to run using HPC MPI libraries.

Please track down what package or script is being run with MPI so we can see if it has been built with system MPI (via Slurm) or if it comes from conda forge, in which case it can't be run with MPI on our systems. This is the reason that I have to build all our MPI tools and libraries with Spack.

xylar commented 1 year ago

Are you seeing the MPI errors only with cProfile or also with normal runs? I could imagine cProfile itself being hard to use properly in this mixed spack and conda environment.

I had in mind to just add calls within e3sm_diags to get the current time at 2 spots in the code and take the difference. The old-fashioned and tedious but also very robust method for timing a code.

chengzhuzhang commented 1 year ago

@mahf708 encountered the same issue on Chrysilas, copying his stack message here: "I face a specific issue on a chrysalis compute node with e3sm_diags within e3sm-unified (lates RC) above:Triggering some e3sm_diags like below returns an elaborate error. Not sure what it means really. I resolve it with simply quitting the node and restarting over. This suggests that some python processes are holding on to the locks or something?"

Actually I used the same solution to get by the problem. Restart a node from fresh would work fine.

chengzhuzhang commented 1 year ago

In my testing, the MPI error occurred both with or without cProfile. Guess I can look into the old fashioned timing method.. Right now the e3sm diags run log contains time stamps, but with multiprocessing, it is not possible to isolate time spent for each processes.

tomvothecoder commented 1 year ago

In my testing, the MPI error occurred both with or without cProfile. Guess I can look into the old fashioned timing method.. Right now the e3sm diags run log contains time stamps, but with multiprocessing, it is not possible to isolate time spent for each processes.

Copying my Slack discussion with Jill

Adding the following code at the top of e3sm_diags_driver.py should make Dask run sequentially for debugging and profiling.

import dask
dask.config.set(scheduler='synchronous')  # overwrite default with single-threaded scheduler

We can look at the timestamps for the logs to see the runtime. I’m hoping running Dask sequentially mimics the same slowness if it calls the same multiprocessing modules. just on one thread/process at a time.

More info on this setting:

The single-threaded synchronous scheduler executes all computations in the local thread with no parallelism at all. This is particularly valuable for debugging and profiling, which are more difficult when using threads or processes.

For example, when using IPython or Jupyter notebooks, the %debug, %pdb, or %prun magics will not work well when using the parallel Dask schedulers (they were not designed to be used in a parallel computing context). However, if you run into an exception and want to step into the debugger, you may wish to rerun your computation under the single-threaded scheduler where these tools will function properly.

https://docs.dask.org/en/stable/scheduling.html#single-thread

xylar commented 1 year ago

@chengzhuzhang, in case it's helpful, I opened:

/lcrc/soft/climate/e3sm-unified/base/envs/e3sm_unified_1.9.0rc14_chrysalis/lib/python3.10/site-packages/e3sm_diags

up for group write permission.

forsyth2 commented 1 year ago

I tried running via zppy with num_workers=4 (rather than default of 24), as @golaz suggested (reasoning that the overhead of the workers may be causing an issue). Unfortunately, it looks like it's still going beyond the original 2-hour limit.

forsyth2 commented 1 year ago

Follow-up on https://github.com/E3SM-Project/zppy/issues/485#issuecomment-1699575612 and https://github.com/E3SM-Project/zppy/issues/485#issuecomment-1703391354

With 4 workers:

tail -n 3 e3sm_diags_atm_monthly_180x360_aave_*.o*
==> e3sm_diags_atm_monthly_180x360_aave_environment_commands_model_vs_obs_1850-1851.o381747 <==
==============================================
Elapsed time: 49 seconds
==============================================

==> e3sm_diags_atm_monthly_180x360_aave_environment_commands_model_vs_obs_1850-1853.o381749 <==
==============================================
Elapsed time: 49 seconds
==============================================

==> e3sm_diags_atm_monthly_180x360_aave_environment_commands_model_vs_obs_1852-1853.o381748 <==
==============================================
Elapsed time: 49 seconds
==============================================

==> e3sm_diags_atm_monthly_180x360_aave_model_vs_obs_1850-1851.o381744 <==
==============================================
Elapsed time: 14235 seconds
==============================================

==> e3sm_diags_atm_monthly_180x360_aave_model_vs_obs_1850-1853.o381746 <==
==============================================
Elapsed time: 14287 seconds
==============================================

==> e3sm_diags_atm_monthly_180x360_aave_model_vs_obs_1852-1853.o381745 <==
==============================================
Elapsed time: 14303 seconds
==============================================

==> e3sm_diags_atm_monthly_180x360_aave_mvm_model_vs_model_1852-1853_vs_1850-1851.o381751 <==
==============================================
Elapsed time: 4186 seconds
==============================================

==> e3sm_diags_atm_monthly_180x360_aave_tc_analysis_model_vs_obs_1850-1851.o381750 <==
==============================================
Elapsed time: 49 seconds
==============================================

So, 49 seconds to 14303 seconds (=238.4 minutes = 3.97 hours). So, going from "47 to 14,688 seconds (= 244.8 minutes = 4.08 hours)", that's only a saving of ~6.4 minutes on the high end.

chengzhuzhang commented 1 year ago

@forsyth2 thank you for the data points. I'm also update some of my initial investigation here. In my case, I only tested "lat_lon" because it can show the slow down alone. I use sbatch to submit the jobs to slurm. <html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">

When running in serial with Dask, I did notice that some lat_lon sets cause longer time, the longest is for comparing with ERA5 OMEGA at 850 mb, it took about 6 mins for each season, so 30 mins for 5 seasons to finish. Among the calculation for each season for this variable, the contour map plotting step can take around 5 mins.

I also suspect there is an issue in scheduling. More investigation is needed.

chengzhuzhang commented 1 year ago

While @tomvothecoder and I are doing more time profiling work and testing, It occurred that running with unified rc14, each processes are almost proportionally longer. Given that there is also slurm/MPI problem intermittently occur, I suspect this is still an environment problem, that some e3sm diags processes is trying to call MPI in the unified env, but the calls are not properly initiated. With srun -n 1 , it can start the parallel program and setup the parallel environment correctly. The fixe is added to zppy https://github.com/E3SM-Project/zppy/pull/497 and should remind users who run e3sm_diags in unified to prefix srun to get rid of the PMI2_Init error reported by @mahf708 and I earlier.

xylar commented 1 year ago

Yay!!!!!

E3SM-Project / zppy

Unified rc12 and rc14 testing: `e3sm_diags` takes too long on Chrysalis #485