Closed forsyth2 closed 3 months ago
@chengzhuzhang As mentioned on #602, this issue seems to be independent of the code changes on #602. The stack trace really doesn't give me much to go off:
NetCDF: HDF error
/var/spool/slurmd/job552617/slurm_script: line 80: 1328058 Killed GenerateCSMesh --res $res --alt --file ${result_dir}outCSne$res.g
The main
branch seems unaffected by this. That means 2 things -- 1) this isn't an issue with storage space on scratch or anything like that affecting me specifically, 2) these two separate pull requests (#612, #602 with 2nd commit only) are somehow independently producing the TC analysis error.
test output dir | branch | base commit | conda env | Ran pip install . && python tests/integration/utils.py ? |
lessons learned |
---|---|---|---|---|---|
/lcrc/group/e3sm/ac.forsyth2/zppy_test_debug_output/test-main-613v2/v2.LR.historical_0201/post/scripts |
test-main-613 | Add center times (611) | zppy_dev_n600 | y | grep -v "OK" *status shows no errors |
/lcrc/group/e3sm/ac.forsyth2/zppy_test_complete_run_output/test-main-613v2/v2.LR.historical_0201/post/scripts |
test-main-613 | Add center times (611) | zppy_dev_n600 | y | grep -v "OK" *status shows no errors |
I ran a debug
version of #602 with the 1st commit dropped and the TC analysis tasks worked. This leads me to believe there's some sort of concurrency issue happening when more jobs are run simultaneously.
Or actually, rather than running too many jobs in one zppy run, it's possible this issue has been coming up because I've been testing multiple branches of zppy
simultaneously, meaning the TC analysis tasks could be trying to write to the same spot in scratch, resulting in race conditions.
There have been run-in-parallel issues with tc_analysis
before. E.g., in zppy/tc_analysis.py
we have:
# There is a `GenerateConnectivityFile: error while loading shared libraries: libnetcdf.so.11: cannot open shared object file: No such file or directory` error
# when multiple year_sets are run simultaneously. Therefore, we will wait for the completion of one year_set before moving on to the next.
Running #602 without the 1st commit does in fact produce no failures when I run complete_run
without any other tests running. I think that pretty much confirms TC Analysis needs something more to avoid concurrency issues / race conditions -- maybe subdirectories based on test unique-id?
I think adding more specificity via subdirectories resolves this issue. See #615. Closing
What happened?
complete_run
can't finish becausetc_analysis_1850-1851
is not completing successfully. The status file says "RUNNING" but there is no corresponding job running. The.o
file shows:Furthermore, this is happening on two different pull requests: #607/#612 and #421/#602 (interestingly this was only happening on one of two testing branches)
What machine were you running on?
Chrysalis
Environment
zppy_dev_n600
What command did you run?
Copy your cfg file
What jobs are failing?
What stack trace are you encountering?