E3SM-Project / e3sm_diags

E3SM Diagnostics package
https://e3sm-project.github.io/e3sm_diags
BSD 3-Clause "New" or "Revised" License
39 stars 32 forks source link

concurrent.futures.process.BrokenProcessPool testing e3sm_unified rc12 on Perlmutter #720

Closed chengzhuzhang closed 1 year ago

chengzhuzhang commented 1 year ago

concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending. details documented here. https://github.com/E3SM-Project/zppy/issues/485#issuecomment-1686792065

Some troubleshooting suggested that the newly added mp_partition set (which is not included in zppy tests) triggered this problem. This issue was not seen in the e3sm_diags rc3 standalone environment, or in e3sm-unified rc12 pm-cpu login.

xylar commented 1 year ago

It could be some sort of conflict between ESMF build with machine compilers and MPI in spack (to get the ESMF_RegirdWeightGen executable) and ESMPy from conda-forge. We need ESMF_RegirdWeightGen with system compilers and MPI to be able to use multiple nodes (or in some cases any MPI parallelism) on HPC. But we cannot build ESMPy with system compilers and MPI (or any python packages). So that still comes from conda-forge.

If that is the issue, it would explain why it works fine on login nodes and in an e3sm_diags test environment, both of which only use conda packages and no spack builds.

If that is the problem, I don't know what to do about it. We rely on ESMF_RegirdWeightGen build with spack as a fundamental tool and we need to use it across many nodes, so the conda package is not an option. I had been pleasantly surprised that ESMPy from conda-forge was working in e3sm_diags even with ESMF built from spack up to now, but it seems like that's no longer the case with ESMF/ESMPy 8.4.2. If that is correct, wish we had discovered this months ago...

xylar commented 1 year ago

I will think about this some more tomorrow. Could you provide me with a small test that reproduces the problem?

forsyth2 commented 1 year ago

I'm encountering this when running zppy for the integration tests on Perlmutter. This error occurs, but zppy doesn't actually note the error as a failure. (If I recall correctly, this might have something to do with E3SM Diags "succeeding" if at least one plot is generated).

$ cd /global/cfs/cdirs/e3sm/forsyth/zppy_test_bundles_output/v2.LR.historical_0201/post/scripts
$ grep -v "OK" *status
# No failures
$ grep -n "concurrent.futures.process.BrokenProcessPool" *
bundle3.o14264254:34037:concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.
grep: global_time_series_1850-1860_dir: Is a directory
grep: global_time_series_1850-1860_results: Is a directory

$ cd /global/cfs/cdirs/e3sm/forsyth/zppy_test_complete_run_output/v2.LR.historical_0201/post/scripts
$ grep -v "OK" *status
# No failures
$ grep -n "concurrent.futures.process.BrokenProcessPool" *
e3sm_diags_atm_monthly_180x360_aave_mvm_model_vs_model_1852-1853_vs_1850-1851.o14019352:10037:concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.
e3sm_diags_atm_monthly_180x360_aave_tc_analysis_model_vs_obs_1850-1851.o14019351:121:concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.
grep: global_time_series_1850-1860_dir: Is a directory
grep: global_time_series_1850-1860_results: Is a directory

For reference, on the other machines:

Chrysalis:

$ cd /lcrc/group/e3sm/ac.forsyth2/zppy_test_bundles_output/v2.LR.historical_0201/post/scripts
$ grep -n "concurrent.futures.process.BrokenProcessPool" *
# No matches

$ cd /lcrc/group/e3sm/ac.forsyth2/zppy_test_complete_run_output/v2.LR.historical_0201/post/scripts
$ grep -n "concurrent.futures.process.BrokenProcessPool" *
# No matches

Compy:

$ cd /compyfs/fors729/zppy_test_bundles_output/v2.LR.historical_0201/post/scripts/
$ grep -n "concurrent.futures.process.BrokenProcessPool" *
# No matches 

$ cd /compyfs/fors729/zppy_test_complete_run_output/v2.LR.historical_0201/post/scripts/
$ grep -n "concurrent.futures.process.BrokenProcessPool" *
# No matches
chengzhuzhang commented 1 year ago

@forsyth2 thanks, can you open permission for: /global/cfs/cdirs/e3sm/forsyth/zppy_test_complete_run_output/ for trouble shooting by others?

forsyth2 commented 1 year ago

I just ran:

$ chmod -R o+r /global/cfs/cdirs/e3sm/forsyth/zppy_test_complete_run_output/
$ chgrp -R e3sm /global/cfs/cdirs/e3sm/forsyth/zppy_test_complete_run_output/
$ chmod -R o+r /global/cfs/cdirs/e3sm/forsyth/zppy_test_bundles_output/
$ chgrp -R e3sm /global/cfs/cdirs/e3sm/forsyth/zppy_test_bundles_output/
chengzhuzhang commented 1 year ago

Test to reproduce:

  1. allocate pm-cpu node: salloc --nodes 1 --qos interactive --time 01:00:00 --constraint cpu --account=e3sm

  2. activate e3sm-unifed rc12 source /global/common/software/e3sm/anaconda_envs/test_e3sm_unified_1.9.0rc12_pm-cpu.sh

  3. e3sm_diags line below (remember to update path name --results_dir )

    e3sm_diags mp_partition --no_viewer \
    --reference_data_path '/global/cfs/cdirs/e3sm/diagnostics/observations/Atm/climatology' \
    --test_data_path '/global/cfs/cdirs/e3sm/e3sm_diags/postprocessed_e3sm_v2_data_for_e3sm_diags/20210528.v2rc3e.piControl.ne30pg2_EC30to60E2r2.chrysalis/time-series/rgr/' \
    --results_dir '/global/cfs/cdirs/e3sm/www/chengzhu/v2_9_0_all_sets_tests/eu12-special-no-viewer' --case_id 'mixed-phase_partition' \
    --ref_timeseries_input \
    --test_timeseries_input \
    --run_type 'model_vs_obs' \
    --sets 'mp_partition' --variables 'LCF' \
    --regions 'global' --regrid_tool 'esmf' --regrid_method 'conservative' \
    --multiprocessing --num_workers '25' --backend 'mpl' \
    --test_name '20210528.v2rc3e.piControl.ne30pg2_EC30to60E2r2.chrysalis' --short_test_name 'e3sm_v2' \
    --ref_name 'McCoy' \
    --granulate 'variables' 'plevs' 'regions' --selectors 'sets' 'seasons' \
    --test_start_yr 0051 --test_end_yr 0060
chengzhuzhang commented 1 year ago

Looking at two run logs @forsyth2 provided. Same problem happened in his zppy test:

one with tc analysis:

/global/cfs/cdirs/e3sm/forsyth/zppy_test_complete_run_output/v2.LR.historical_0201/post/scripts/e3sm_diags_atm_monthly_180x360_aave_tc_analysis_model_vs_obs_1850-1851.o14019351

the other with model vs mode:

/global/cfs/cdirs/e3sm/forsyth/zppy_test_complete_run_output/v2.LR.historical_0201/post/scripts/e3sm_diags_atm_monthly_180x360_aave_mvm_model_vs_model_1852-1853_vs_1850-1851.o14019352
xylar commented 1 year ago

I'm afraid I haven't had time to make any headway on this today. I don't know when I will have time, hopefully by Friday.

chengzhuzhang commented 1 year ago

I will spend some time today. @tomvothecoder let me know if you have some insights and can help trouble shooting this issue. I believe this is the last issue we need to resolve before final release.

xylar commented 1 year ago

My suspicion is that it's an incompatibility between ESMF build with spack and ESMPy from conda-forge. I don't know how much you two will be able to do to untangle that, @chengzhuzhang and @tomvothecoder, but whatever you can discover would be helpful.

Having ESMF built with system compilers and MPI is one of the fundamental services that E3SMU provides to the team so that is pretty hard to consider taking off the table. At the same time, e3sm_diags is very fundamental to the project, too, so we can't have a configuration where it doesn't work.

I have thought about trying to rebuild ESMPy in the conda environment but linking to the ESMF in the spack environment. This may work or it may be a total disaster. It's the kind of thing that I would normally investigate over several weeks as I have time, so it feels very stressful to pursue this under time pressure.

I am swamped at the moment with trying to get E3SM v3 ocean and sea ice meshes tested in time for them to be useful before the November deadline. As urgent as E3SMU feels, the meshes are more urgent.

tomvothecoder commented 1 year ago

I will spend some time today. @tomvothecoder let me know if you have some insights and can help trouble shooting this issue. I believe this is the last issue we need to resolve before final release.

I'll look at the stacktrace and try to step through the code to provide a minimal reproducible example that might help trace the root cause.

chengzhuzhang commented 1 year ago

@tomvothecoder thank you, that would be helpful! Though I tend to agree with what Xylar suggested that the incompatibility of packages in unified might be the cause. We perhaps should check from e3sm_diags to rule out if anything in the multiprocessing with dask scheduler is messing things up or to find the root cause would be useful.

tomvothecoder commented 1 year ago

I will spend some time today. @tomvothecoder let me know if you have some insights and can help trouble shooting this issue. I believe this is the last issue we need to resolve before final release.

I'll look at the stacktrace and try to step through the code to provide a minimal reproducible example that might help trace the root cause.

@tomvothecoder thank you, that would be helpful! Though I tend to agree with what Xylar suggested that the incompatibility of packages in unified might be the cause. We perhaps should check from e3sm_diags to rule out if anything in the multiprocessing with dask scheduler is messing things up or to find the root cause would be useful.

I actually don't think I can do much here because this issue only occurs in E3SM Unified and is not reproducible in a local e3sm_diags rc3/dev environment (some good news at least).

Even if we can reproduce the error, I probably won't be able to step through the code to debug. I read that ProcessPoolExecutor (which throws the BrokenProcessPool error) doesn't work with interactive consoles (source).

It could be some sort of conflict between ESMF build with machine compilers and MPI in spack (to get the ESMF_RegirdWeightGen executable) and ESMPy from conda-forge.

@xylar you probably already thought of this, but we can try creating a standalone environment that includes e3sm_diags (and esmpy) with MPI in spack. We can run the e3sm_diags command in this comment to see if the error is thrown, which might help isolate the issue to specific packages in E3SM Unified.

Unfortunately I'm not familiar with the MPI build process to get this going.

tomvothecoder commented 1 year ago

Also here's the complete stacktrace with the MPI output if it is helpful:

(e3sm_unified_1.9.0rc12_pm-cpu) vo13@nid004307:~/E3SM-Project/e3sm_diags> e3sm_dia
gs mp_partition --no_viewer --reference_data_path '/global/cfs/cdirs/e3sm/diagnost
ics/observations/Atm/climatology' --test_data_path '/global/cfs/cdirs/e3sm/e3sm_di
ags/postprocessed_e3sm_v2_data_for_e3sm_diags/20210528.v2rc3e.piControl.ne30pg2_EC
30to60E2r2.chrysalis/time-series/rgr/' --results_dir '/global/cfs/cdirs/e3sm/www/v
o13/v2_9_0_all_sets_tests/eu12-special-no-viewer' --case_id 'mixed-phase_partition
' --ref_timeseries_input --test_timeseries_input --run_type 'model_vs_obs' --sets 
'mp_partition' --variables 'LCF' --regions 'global' --regrid_tool 'esmf' --regrid_
method 'conservative' --multiprocessing --num_workers '25' --backend 'mpl' --test_
name '20210528.v2rc3e.piControl.ne30pg2_EC30to60E2r2.chrysalis' --short_test_name 
'e3sm_v2' --ref_name 'McCoy' --granulate 'variables' 'plevs' 'regions' --selectors
 'sets' 'seasons' --test_start_yr 0051 --test_end_yr 0060
PE 0: MPICH processor detected:
PE 0:   AMD Milan (25:1:1) (family:model:stepping)
MPI VERSION    : CRAY MPICH version 8.1.24.16 (ANL base 3.4a2)
MPI BUILD INFO : Wed Jan 18 17:36 2023 (git hash 11b1c78) (CH4)
PE 0: MPICH environment settings =====================================
PE 0:   MPICH_ENV_DISPLAY                              = 1
PE 0:   MPICH_VERSION_DISPLAY                          = 1
PE 0:   MPICH_ABORT_ON_ERROR                           = 0
PE 0:   MPICH_CPUMASK_DISPLAY                          = 0
PE 0:   MPICH_STATS_DISPLAY                            = 0
PE 0:   MPICH_RANK_REORDER_METHOD                      = 1
PE 0:   MPICH_RANK_REORDER_DISPLAY                     = 0
PE 0:   MPICH_MEMCPY_MEM_CHECK                         = 0
PE 0:   MPICH_USE_SYSTEM_MEMCPY                        = 0
PE 0:   MPICH_OPTIMIZED_MEMCPY                         = 1
PE 0:   MPICH_ALLOC_MEM_PG_SZ                          = 4096
PE 0:   MPICH_ALLOC_MEM_POLICY                         = PREFERRED
PE 0:   MPICH_ALLOC_MEM_AFFINITY                       = SYS_DEFAULT
PE 0:   MPICH_MALLOC_FALLBACK                          = 0
PE 0:   MPICH_MEM_DEBUG_FNAME                          = 
PE 0:   MPICH_INTERNAL_MEM_AFFINITY                    = SYS_DEFAULT
PE 0:   MPICH_NO_BUFFER_ALIAS_CHECK                    = 0
PE 0:   MPICH_COLL_SYNC                                = 0
PE 0:   MPICH_SINGLE_HOST_ENABLED                        = 1
PE 0: MPICH/RMA environment settings =================================
PE 0:   MPICH_RMA_MAX_PENDING                          = 128
PE 0:   MPICH_RMA_SHM_ACCUMULATE                       = 0
PE 0: MPICH/Dynamic Process Management environment settings ==========
PE 0:   MPICH_DPM_DIR                                  = 
PE 0:   MPICH_LOCAL_SPAWN_SERVER                       = 0
PE 0:   MPICH_SPAWN_USE_RANKPOOL                       = 1
PE 0: MPICH/SMP environment settings =================================
PE 0:   MPICH_SMP_SINGLE_COPY_MODE                     = XPMEM
PE 0:   MPICH_SMP_SINGLE_COPY_SIZE                     = 8192
PE 0:   MPICH_SHM_PROGRESS_MAX_BATCH_SIZE              = 8
PE 0: MPICH/COLLECTIVE environment settings ==========================
PE 0:   MPICH_COLL_OPT_OFF                             = 0
PE 0:   MPICH_BCAST_ONLY_TREE                          = 1
PE 0:   MPICH_BCAST_INTERNODE_RADIX                    = 4
PE 0:   MPICH_BCAST_INTRANODE_RADIX                    = 4
PE 0:   MPICH_ALLTOALL_SHORT_MSG                       = 64-512
PE 0:   MPICH_ALLTOALL_SYNC_FREQ                       = 1-24
PE 0:   MPICH_ALLTOALLV_THROTTLE                       = 8
PE 0:   MPICH_ALLGATHER_VSHORT_MSG                     = 1024-4096
PE 0:   MPICH_ALLGATHERV_VSHORT_MSG                    = 1024-4096
PE 0:   MPICH_GATHERV_SHORT_MSG                        = 131072
PE 0:   MPICH_GATHERV_MIN_COMM_SIZE                    = 64
PE 0:   MPICH_GATHERV_MAX_TMP_SIZE                     = 536870912
PE 0:   MPICH_GATHERV_SYNC_FREQ                        = 16
PE 0:   MPICH_IGATHERV_RAND_COMMSIZE                   = 2048
PE 0:   MPICH_IGATHERV_RAND_RECVLIST                   = 0
PE 0:   MPICH_SCATTERV_SHORT_MSG                       = 2048-8192
PE 0:   MPICH_SCATTERV_MIN_COMM_SIZE                   = 64
PE 0:   MPICH_SCATTERV_MAX_TMP_SIZE                    = 536870912
PE 0:   MPICH_SCATTERV_SYNC_FREQ                       = 16
PE 0:   MPICH_SCATTERV_SYNCHRONOUS                     = 0
PE 0:   MPICH_ALLREDUCE_MAX_SMP_SIZE                   = 262144
PE 0:   MPICH_ALLREDUCE_BLK_SIZE                       = 716800
PE 0:   MPICH_GPU_ALLREDUCE_USE_KERNEL                 = 0
PE 0:   MPICH_GPU_COLL_STAGING_BUF_SIZE                = 1048576
PE 0:   MPICH_GPU_ALLREDUCE_STAGING_THRESHOLD          = 256
PE 0:   MPICH_ALLREDUCE_NO_SMP                         = 0
PE 0:   MPICH_REDUCE_NO_SMP                            = 0
PE 0:   MPICH_REDUCE_SCATTER_COMMUTATIVE_LONG_MSG_SIZE = 524288
PE 0:   MPICH_REDUCE_SCATTER_MAX_COMMSIZE              = 1000
PE 0:   MPICH_SHARED_MEM_COLL_OPT                      = 1
PE 0:   MPICH_SHARED_MEM_COLL_NCELLS                   = 8
PE 0:   MPICH_SHARED_MEM_COLL_CELLSZ                   = 256
PE 0: MPICH MPIIO environment settings ===============================
PE 0:   MPICH_MPIIO_HINTS_DISPLAY                      = 0
PE 0:   MPICH_MPIIO_HINTS                              = NULL
PE 0:   MPICH_MPIIO_ABORT_ON_RW_ERROR                  = disable
PE 0:   MPICH_MPIIO_CB_ALIGN                           = 2
PE 0:   MPICH_MPIIO_DVS_MAXNODES                       = 24
PE 0:   MPICH_MPIIO_AGGREGATOR_PLACEMENT_DISPLAY       = 0
PE 0:   MPICH_MPIIO_AGGREGATOR_PLACEMENT_STRIDE        = -1
PE 0:   MPICH_MPIIO_MAX_NUM_IRECV                      = 50
PE 0:   MPICH_MPIIO_MAX_NUM_ISEND                      = 50
PE 0:   MPICH_MPIIO_MAX_SIZE_ISEND                     = 10485760
PE 0:   MPICH_MPIIO_OFI_STARTUP_CONNECT                = disable
PE 0:   MPICH_MPIIO_OFI_STARTUP_NODES_AGGREGATOR        = 2
PE 0: MPICH MPIIO statistics environment settings ====================
PE 0:   MPICH_MPIIO_STATS                              = 0
PE 0:   MPICH_MPIIO_TIMERS                             = 0
PE 0:   MPICH_MPIIO_WRITE_EXIT_BARRIER                 = 1
PE 0: MPICH Thread Safety settings ===================================
PE 0:   MPICH_ASYNC_PROGRESS                           = 0
PE 0:   MPICH_OPT_THREAD_SYNC                          = 1
PE 0:   rank 0 required = multiple, was provided = multiple
Traceback (most recent call last):
  File "/global/common/software/e3sm/anaconda_envs/base/envs/e3sm_unified_1.9.0rc12_pm-cpu/bin/e3sm_diags", line 10, in <module>
    sys.exit(main())
  File "/global/common/software/e3sm/anaconda_envs/base/envs/e3sm_unified_1.9.0rc12_pm-cpu/lib/python3.10/site-packages/e3sm_diags/e3sm_diags_driver.py", line 419, in main
    parameters_results = _run_with_dask(parameters)
  File "/global/common/software/e3sm/anaconda_envs/base/envs/e3sm_unified_1.9.0rc12_pm-cpu/lib/python3.10/site-packages/e3sm_diags/e3sm_diags_driver.py", line 365, in _run_with_dask
    results = bag.map(run_diag).compute(num_workers=num_workers)
  File "/global/common/software/e3sm/anaconda_envs/base/envs/e3sm_unified_1.9.0rc12_pm-cpu/lib/python3.10/site-packages/dask/base.py", line 310, in compute
    (result,) = compute(self, traverse=False, **kwargs)
  File "/global/common/software/e3sm/anaconda_envs/base/envs/e3sm_unified_1.9.0rc12_pm-cpu/lib/python3.10/site-packages/dask/base.py", line 595, in compute
    results = schedule(dsk, keys, **kwargs)
  File "/global/common/software/e3sm/anaconda_envs/base/envs/e3sm_unified_1.9.0rc12_pm-cpu/lib/python3.10/site-packages/dask/multiprocessing.py", line 233, in get
    result = get_async(
  File "/global/common/software/e3sm/anaconda_envs/base/envs/e3sm_unified_1.9.0rc12_pm-cpu/lib/python3.10/site-packages/dask/local.py", line 500, in get_async
    for key, res_info, failed in queue_get(queue).result():
  File "/global/common/software/e3sm/anaconda_envs/base/envs/e3sm_unified_1.9.0rc12_pm-cpu/lib/python3.10/concurrent/futures/_base.py", line 451, in result
    return self.__get_result()
  File "/global/common/software/e3sm/anaconda_envs/base/envs/e3sm_unified_1.9.0rc12_pm-cpu/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.
chengzhuzhang commented 1 year ago

@tomvothecoder I'm actually struggling with the same problem that we can't reproduce this environment while installing a development e3sm_diags version for debugging...

forsyth2 commented 1 year ago

@xylar @chengzhuzhang For Unified rc14, testing zppy on Perlmutter:

$ cd /global/cfs/cdirs/e3sm/forsyth/zppy_test_bundles_output/v2.LR.historical_0201/post_20230829/scripts/
$ grep -v "OK" *status
bundle1.status:RUNNING 14629937
e3sm_diags_atm_monthly_180x360_aave_model_vs_obs_1852-1853.status:RUNNING 14629937
$ tail -n 20 bundle1.o14629937 
Traceback (most recent call last):
  File "/global/common/software/e3sm/anaconda_envs/base/envs/e3sm_unified_1.9.0rc14_pm-cpu/lib/python3.10/site-packages/e3sm_diags/run.py", line 34, in run_diags
    main(final_params)
  File "/global/common/software/e3sm/anaconda_envs/base/envs/e3sm_unified_1.9.0rc14_pm-cpu/lib/python3.10/site-packages/e3sm_diags/e3sm_diags_driver.py", line 419, in main
    parameters_results = _run_with_dask(parameters)
  File "/global/common/software/e3sm/anaconda_envs/base/envs/e3sm_unified_1.9.0rc14_pm-cpu/lib/python3.10/site-packages/e3sm_diags/e3sm_diags_driver.py", line 365, in _run_with_dask
    results = bag.map(run_diag).compute(num_workers=num_workers)
  File "/global/common/software/e3sm/anaconda_envs/base/envs/e3sm_unified_1.9.0rc14_pm-cpu/lib/python3.10/site-packages/dask/base.py", line 310, in compute
    (result,) = compute(self, traverse=False, **kwargs)
  File "/global/common/software/e3sm/anaconda_envs/base/envs/e3sm_unified_1.9.0rc14_pm-cpu/lib/python3.10/site-packages/dask/base.py", line 595, in compute
    results = schedule(dsk, keys, **kwargs)
  File "/global/common/software/e3sm/anaconda_envs/base/envs/e3sm_unified_1.9.0rc14_pm-cpu/lib/python3.10/site-packages/dask/multiprocessing.py", line 233, in get
    result = get_async(
  File "/global/common/software/e3sm/anaconda_envs/base/envs/e3sm_unified_1.9.0rc14_pm-cpu/lib/python3.10/site-packages/dask/local.py", line 500, in get_async
    for key, res_info, failed in queue_get(queue).result():
  File "/global/common/software/e3sm/anaconda_envs/base/envs/e3sm_unified_1.9.0rc14_pm-cpu/lib/python3.10/concurrent/futures/_base.py", line 451, in result
    return self.__get_result()
  File "/global/common/software/e3sm/anaconda_envs/base/envs/e3sm_unified_1.9.0rc14_pm-cpu/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.
xylar commented 1 year ago

AAAAArgh! Okay, it seems like we need to add import e3sm_diags at the top of e3sm_diags/run.py before any other e3sm_diags imports. At least that's what I would try next.

If we need a new E3SM-Unified RC to test this, it's going to take 2 days. Can we come up with a way to do this faster? Like what I suggested in https://github.com/E3SM-Project/e3sm_diags/pull/722#issuecomment-1695578452, but for zppy instead of e3sm_diags?

xylar commented 1 year ago

I don't think so there's any point in just rerunning.

xylar commented 1 year ago

Maybe I can just edit the code in rc14 on Perlmutter and we can try this. That's what I actually just did by accident (and then undid it).

But I'm not available to do that for a couple of hours and no one else has permission.

forsyth2 commented 1 year ago

Can we come up with a way to do this faster?

Would that mean the following?

I don't think so there's any point in just rerunning.

I was hoping maybe the error was a fluke, but yeah there probably isn't much point.

But I'm not available to do that for a couple of hours and no one else has permission.

I also can't do the steps mentioned above in the next couple hours.

xylar commented 1 year ago

@forsyth2, yes, exactly. Try that when you have time. I will edit rc14 and ping you when I have time. We'll see what works.

I just can't quite believe how long this process has dragged on.

forsyth2 commented 1 year ago

Actually, re-running did eliminate the problem. This leads me to believe this error is intermittent, thus making it hard to truly debug...

$ cd /global/cfs/cdirs/e3sm/forsyth/zppy_test_bundles_output/v2.LR.historical_0201/post_20230830_wo_edits/scripts
$ grep -v "OK" *status
# No failures
$ tail bundle1.o14673095 
===== COPY FILES TO WEB SERVER =====

/global/cfs/cdirs/e3sm/www/forsyth/zppy_test_bundles_www/v2.LR.historical_0201/e3sm_diags/atm_monthly_180x360_aave /global/cfs/cdirs/e3sm/forsyth/zppy_test_bundles_output/v2.LR.historical_0201/post/scripts/tmp.14673095.61yc
/global/cfs/cdirs/e3sm/forsyth/zppy_test_bundles_output/v2.LR.historical_0201/post/scripts/tmp.14673095.61yc
==============================================
Elapsed time: 398 seconds
==============================================
==============================================
Elapsed time: 2708 seconds
==============================================

In any case, currently running with the edit to E3SM Diags.

xylar commented 1 year ago

Okay, that's interesting!

I went ahead and added import e3sm_diags to the run.py package:

$ vim /global/common/software/e3sm/anaconda_envs/base/envs/e3sm_unified_1.9.0rc14_pm-cpu/lib/python3.10/site-packages/e3sm_diags/run.py
import copy

import e3sm_diags
from e3sm_diags.e3sm_diags_driver import get_default_diags_path, main
from e3sm_diags.logger import custom_logger, move_log_to_prov_dir
from e3sm_diags.parameter import SET_TO_PARAMETERS
from e3sm_diags.parameter.core_parameter import CoreParameter
from e3sm_diags.parser.core_parser import CoreParser

...

Could you run a few more times and see if it ever happens again with this change?

xylar commented 1 year ago

Frustrating if it's intermittent. That would make it very hard to debug.

forsyth2 commented 1 year ago

Could you run a few more times and see if it ever happens again with this change?

Yeah, I will try to do a few runs of the bundles-run specifically, tomorrow.

forsyth2 commented 1 year ago

@xylar Ok, I ran once with making the import e3sm_diags change myself and twice using your updated rc14 and haven't run into this error. I would say the issue is resolved.

xylar commented 1 year ago

Except that we need to actually make this change in e3sm_diags itself...

chengzhuzhang commented 1 year ago

@xylar yes, I'm aware of this and will create a PR. I'm wondering if you could edit the source code on Chrysalis, so that we can have a short cut to test. (i'm pretty desperate about the slow e3sm_diags run on Chrysalis, and wondered if this fix can help..) Update: Never mind, it looks like I can use an e3sm_diags dev env + spack env to test.

chengzhuzhang commented 1 year ago

Just to update: the a PR has been submitted for resolving this issue. And I also confirmed this fixes doesn't help to shorten runtime on Chrysalis with Slurm submission.