E3SM-Project / e3sm_diags

E3SM Diagnostics package
https://e3sm-project.github.io/e3sm_diags
BSD 3-Clause "New" or "Revised" License
42 stars 32 forks source link

[Bug]: using zppy -c e3sm.cfg on LCRC #862

Closed keziming closed 1 month ago

keziming commented 1 month ago

What happened?

LCRC changed job submission from Slurm to PBS. Please help me! Thank you in advance!

When I submit a job that worked yesterday, the error come out as

zppy -c ./post.v2.chemUCI.LR.amip_0101.1870-2014.cfg

Problem submitting script /lcrc/group/e3sm/ac.zke/E3SMv3_dev/20231110.uci-linoz3.1870-2014.09142022branch.t0.master.v2_like.F20TR.chrysalis/post/scripts/ts_atm_daily_180x360_aave_1870-1874-0005.bash sbatch --export=ALL /lcrc/group/e3sm/ac.zke/E3SMv3_dev/20231110.uci-linoz3.1870-2014.09142022branch.t0.master.v2_like.F20TR.chrysalis/post/scripts/ts_atm_daily_180x360_aave_1870-1874-0005.bash b'sbatch: error: Batch job submission failed: Invalid account or account/partition combination specified\n' Traceback (most recent call last): File "/lcrc/soft/climate/e3sm-unified/base/envs/e3sm_unified_1.10.0_login/bin/zppy", line 10, in sys.exit(main()) File "/lcrc/soft/climate/e3sm-unified/base/envs/e3sm_unified_1.10.0_login/lib/python3.10/site-packages/zppy/main.py", line 193, in main existing_bundles = ts(config, scriptDir, existing_bundles, job_ids_file) File "/lcrc/soft/climate/e3sm-unified/base/envs/e3sm_unified_1.10.0_login/lib/python3.10/site-packages/zppy/ts.py", line 113, in ts submitScript(scriptFile, statusFile, export, job_ids_file) File "/lcrc/soft/climate/e3sm-unified/base/envs/e3sm_unified_1.10.0_login/lib/python3.10/site-packages/zppy/utils.py", line 242, in submitScript raise RuntimeError(error_str) RuntimeError: Problem submitting script /lcrc/group/e3sm/ac.zke/E3SMv3_dev/20231110.uci-linoz3.1870-2014.09142022branch.t0.master.v2_like.F20TR.chrysalis/post/scripts/ts_atm_daily_180x360_aave_1870-1874-0005.bash

You can look at my original code at /home/ac.zke/E3SM_diag/post.v2.chemUCI.LR.amip_0101.1870-2014.cfg

What did you expect to happen? Are there are possible answers you came across?

No response

Minimal Complete Verifiable Example (MVCE)

No response

Relevant log output

No response

Anything else we need to know?

No response

Environment

e3sm_unified_1.10.0_login

chengzhuzhang commented 1 month ago

@keziming hey, are you using Chryslias. I'm going through LCRC support emails, and they only indicated imporv, bebop and swing are switching to PBS. I don't think Chrysalis is one of those impacted..

xylar commented 1 month ago

@keziming, within the file:

/lcrc/group/e3sm/ac.zke/E3SMv3_dev/20231110.uci-linoz3.1870-2014.09142022branch.t0.master.v2_like.F20TR.chrysalis/post/scripts/ts_atm_daily_180x360_aave_1870-1874-0005.bash

I am seeing:

#!/bin/bash

# Running on anvil

#SBATCH  --job-name=ts_atm_daily_180x360_aave_1870-1874-0005
#SBATCH  --account=condo
#SBATCH  --nodes=1
#SBATCH  --output=/lcrc/group/e3sm/ac.zke/E3SMv3_dev/20231110.uci-linoz3.1870-2014.09142022branch.t0.master.v2_like.F20TR.chrysalis/post/scripts/ts_atm_daily_180x360_aave_1870-1874-0005.o%j
#SBATCH  --exclusive
#SBATCH  --time=0:10:00

#SBATCH  --partition=compute

Are you trying to run on Anvil? If so, these line in your config file are not appropriate:

partition = compute
environment_commands = "source /lcrc/soft/climate/e3sm-unified/load_latest_e3sm_unified_chrysalis.sh"

They are for Chrysalis.

xylar commented 1 month ago

If you are trying to run on Chrysalis, the question is why the Anvil template is being used for the job script.

xylar commented 1 month ago

LCRC changed job submission from Slurm to PBS.

@chengzhuzhang is correct that this is not related. Neither Anvil nor Chrysalis has switched to PBS.

forsyth2 commented 1 month ago

Yes it's a Slurm error you're encountering: sbatch: error: Batch job submission failed: Invalid account or account/partition combination specified\n'. If it wasn't even using SLURM, I wouldn't expect to see this error.

forsyth2 commented 1 month ago

It looks like this was run on Anvil.

$ grep -n machine /lcrc/group/e3sm/ac.zke/E3SMv3_dev/20231110.uci-linoz3.1870-2014.09142022branch.t0.master.v2_like.F20TR.chrysalis/post/scripts/ts_atm_daily_180x360_aave_1870-1874-0005.settings
25:  'machine': 'anvil',
chengzhuzhang commented 1 month ago

I did try running on Chrysalis with zppy -c /home/ac.zke/E3SM_diag/post.v2.chemUCI.LR.amip_0101.1870-2014.cfg

All jobs are submitted, but saw errors in ts_atm_daily_180x360_aave tasks.

keziming commented 1 month ago

@keziming hey, are you using Chryslias. I'm going through LCRC support emails, and they only indicated imporv, bebop and swing are switching to PBS. I don't think Chrysalis is one of those impacted..

@chengzhuzhang I login at chrlogin1 on LCRC

xylar commented 1 month ago

@keziming, is it possible that you accidentally sourced the E3SM-Unified load script for Anvil, not Chrysalis? That would identify the machine to zppy as being Anvil.

keziming commented 1 month ago

It looks like this was run on Anvil.

$ grep -n machine /lcrc/group/e3sm/ac.zke/E3SMv3_dev/20231110.uci-linoz3.1870-2014.09142022branch.t0.master.v2_like.F20TR.chrysalis/post/scripts/ts_atm_daily_180x360_aave_1870-1874-0005.settings
25:  'machine': 'anvil',

@forsyth2 @xylar thanks for pointing out it. How should I set it to chrysalis, if you look at my *cfg file

keziming commented 1 month ago

@xylar @forsyth2 @chengzhuzhang I found my error. I use the wrong unified-e3sm source before I run zppy. Now, it works. Thanks a lot!

xylar commented 1 month ago

Okay if we close this?

keziming commented 1 month ago

Yes. Thank you all!