Closed liujake closed 7 months ago
Hi Jake, Would you please copy my .cylc/flow/global.cylc on my home dir to your global.cylc first?
Thank you @junmeiban . It seems running after copying your global.cylc file below.
[platforms]
# The localhost platform is available by default
# [[localhost]]
# hosts = localhost
# install target = localhost
[[pbs_cluster]]
hosts = localhost
job runner = pbs
install target = localhost
[install]
[[symlink dirs]]
[[[localhost]]]
run = /glade/derecho/scratch/$USER/
My previous global.cylc copied from Tao, which missed the symlink part, so cylc-run was under 'home' instead of 'scratch' space. But it was consistent with the online README of the workflow document.
@jim-p-w Can you update online readme for the global.cylc file? Also README should mention the step to 'source machine.csh/sh'. There was a step to 'source cheyenne.csh' in the old readme. It is missing in the current readme.
Does anyone have a newly-compiled obs2ioda.x somewhere on derecho to fix the issue https://github.com/NCAR/MPAS-Workflow/issues/261?
Please find my obs2ioda-v2.x under /glade/work/taosun/Derecho/MPAS/Obs2IODA/src
If you want to build the executable of your own, you can refer to the Makefile for building with gnu and Makefile.intel for building with intel. One thing I want to emphasize is that I have made some modifications to the code to let the output IODA file not contain observation sites that have bad QC flags.
In cylc 8, we can use cylc tui $workflow-id
to check the status of a running suite.
@mos3r3n We should use the standard version of obs2ioda instead of your modified version. @zhumingying may have one build code of obs2ioda as she recently ran some obs conversion.
@liujake Regarding the README file in the develop branch, I think the steps are correct.
It is no longer necessary to source env-setup/machine.csh or env-setup/machine.sh.
The symlink section of the $HOME/.cylc/flow/global.cylc is optional. If you don't have that section, data files will be created in $HOME, but the workflows should run (assuming there's room in $HOME).
Note the cylc configuration file in $HOME/.cylc now needs to be in a subdirectory called flow, and is now named global.cylc,
so the config file has changed from $HOME/.cylc/global.rc
to $HOME/.cylc/flow/global.cylc
@jim-p-w Yes. I have global.cylc under $HOME/.cylc/flow. But if no having 'optional' symlink section, my test.csh failed. With this section, it worked Ok with apparently 6 of 9 successful. 2 failed ColdStart ones below are due to the old obs2ioda.x used:
3dvar_O30kmIE60km_ColdStart_TEST
liuz_3dvar_OIE120km_ColdStart_TEST
Another failing one is
3denvar_O30kmIE60km_WarmStart_TEST
in the DA step, but without an error message in
/glade/derecho/scratch/liuz/pandac/liuz_3denvar_O30kmIE60km_WarmStart_TEST/CyclingDA/2018041500/run/mem001/jedi.log
So you can run test.csh successfully (e.g., 6 of 9) without the symlink section in your global.cylc?
@jim-p-w 'source env-setup' is to activate your python environment with cylc-8 installation. I do not understand why it is optional? You meant we can use the standard CISL python env or spack-stack's python env? If so, we still need to inform in the README that what/how to load those python/cylc-8 environment before run test.csh, right?
@liujake The scripts and python code set up the environment. It is loading the cylc 8 environment installed in /glade/work/jwittig/conda-envs/my-cylc8.2
I am running a bash shell, I will need to retest running tcsh. That won't happen till tomorrow (it can take a day for a new shell to get set up on derecho using SAM). I ran test.csh using tcsh, all the workflows ran (I am seeing two workflows fail, regardless of environment in the Variational1 step).
I am running w/o the symlink section in ~/.cylc/flow/global.cylc, and the workflows are running. This will also take a while for results, all of my jobs being submitted to compute nodes are stuck in the queue.
I have not compiled obs2ioda-v2 in Derecho. I processed observations in Casper using the old executable obs2ioda-v2.x, which was compiled in Cheyenne and worked well in Casper.
On Mon, Jan 22, 2024 at 2:13 PM Zhiquan (Jake) Liu @.***> wrote:
Assigned #278 https://github.com/NCAR/MPAS-Workflow/issues/278 to @zhumingying https://github.com/zhumingying.
— Reply to this email directly, view it on GitHub https://github.com/NCAR/MPAS-Workflow/issues/278#event-11562063335, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFASISMJCHOGZQA6HU4YFVDYP3I63AVCNFSM6AAAAABCF36EUCVHI2DSMVQWIX3LMV45UABCJFZXG5LFIV3GK3TUJZXXI2LGNFRWC5DJN5XDWMJRGU3DEMBWGMZTGNI . You are receiving this because you were assigned.Message ID: @.***>
Hi @liujake, my global.cylc (in /glade/u/home/ivette/.cylc/flow
) does not contain the symlink section and the experiments are running successfully:
# The localhost platform is available by default
# [[localhost]]
# hosts = localhost
# install target = localhost
[[pbs_cluster]]
hosts = localhost
job runner = pbs
install target = localhost
One thing I should point out is that I added the lines below to my .bashrc, which seems to be enough to be able to submit the suites, given that it loads the right python environment. We may need to revisit this in the Readme and the machine.sh/machine.csh
script.
export CYLC_ENV=/glade/work/jwittig/conda-envs/my-cylc8.2
source /etc/profile.d/z00_modules.sh
module purge
module load ncarenv/23.09
module load conda/latest
conda activate $CYLC_ENV
I ran the test 3denvar_O30kmIE60km_WarmStart_TEST
and it seems to fail because of memory issues. In the scenario yaml (test/testinput/3denvar_O30kmIE60km_WarmStart.yaml
) we specify memory: 45GB
, which overwrites the default value (235GB) in scenarios/defaults/variational.yaml
. I ran another test removing that part from that test scenario yaml file and it completed successfully. We should open a PR to fix this.
Regarding the ColdStart tasks, I made a new compilation on Cheyenne (to solve the issue https://github.com/NCAR/MPAS-Workflow/issues/261) that seems to be working on Derecho. I just ran 3dvar_OIE120km_ColdStart_TEST
and it worked correctly using the obs2ioda executable in /glade/campaign/mmm/parc/ivette/pandac/fork_obs2ioda/obs2ioda/obs2ioda-v2/src
. You can take a look at my results in /glade/derecho/scratch/ivette/pandac/ivette_3dvar_OIE120km_ColdStart_TEST_2/Observations/2022020106
.
@ibanos90 Thanks for testing and fixing the issues. Feel free to make a PR with your fixes.
And It is a mystery to me about whether we need symlink section in global.cylc.
And It is a mystery to me about whether we need symlink section in global.cylc.
Yeah, I don't understand it either. I think it should be fine as long as we have space in the home directory.
Yes, I tried to run the experiment of O30kmIE60km with CrIS, it failed on forecast if we use default setting of 1X128 or 2X128 processors, we need to set:
forecast:
job:
30km:
nodes: 1
PEPerNode: 128
memory: 235GB
Job ID User Queue Nodes NCPUs NGPUs Finish Time Req Mem Used Mem(GB) Avg CPU (%) Elapsed ( h) Job Name
2855797 zhuming cpu 1 128 0 01-19T12:02 45.0 45.0 26.6 0.01 Forecast1.20180424T1200Z.MPAS-Worflow
2885135 zhuming cpu 2 256 0 01-22T16:56 90.0 90.0 43.5 0.01 Forecast1.20180425T0000Z.MPAS-Workflow-zhuming_3dhybrid-60-60-iter_O30kmI60km_VarBC_bnd_iasi_cris
2881537 zhuming cpu 1 128 0 01-22T11:49 235.0 133.9 100.0 0.28 Forecast1.20180424T1800Z.MPAS-Workflow-zhuming_3dhybrid-60-60-iter_O30kmI60km_VarBC_bnd_iasi_cris
2887419 zhuming cpu 2 256 0 01-22T21:26 470.0 140.4 100.0 0.10 Forecast1.20180425T0600Z.MPAS-Workflow-zhuming_3dhybrid-60-60-iter_O30kmI60km_VarBC_bnd_iasi_cris
Zhuming
On Tue, Jan 23, 2024 at 11:39 AM ibanos90 @.***> wrote:
Hi @liujake https://github.com/liujake, my global.cylc (in /glade/u/home/ivette/.cylc/flow) does not contain the symlink section and the experiments are running successfully:
# The localhost platform is available by default # [[localhost]] # hosts = localhost # install target = localhost [[pbs_cluster]] hosts = localhost job runner = pbs install target = localhost
One thing I should point out is that I added the lines below to my .bashrc, which seems to be enough to be able to submit the suites, given that it loads the right python environment. We may need to revisit this in the Readme and the machine.sh/machine.csh script.
export CYLC_ENV=/glade/work/jwittig/conda-envs/my-cylc8.2 source /etc/profile.d/z00_modules.sh module purge module load ncarenv/23.09 module load conda/latest
conda activate $CYLC_ENV
I ran the test 3denvar_O30kmIE60km_WarmStart_TEST and it seems to fail because of memory issues. In the scenario yaml ( test/testinput/3denvar_O30kmIE60km_WarmStart.yaml) we specify memory: 45GB, which overwrites the default value (235GB) in scenarios/defaults/variational.yaml. I ran another test removing that part from that test scenario yaml file and it completed successfully. We should open a PR to fix this.
Regarding the ColdStart tasks, I made a new compilation on Cheyenne (to solve the issue #261 https://github.com/NCAR/MPAS-Workflow/issues/261) that seems to be working on Derecho. I just ran 3dvar_OIE120km_ColdStart_TEST and it worked correctly using the obs2ioda executable in /glade/campaign/mmm/parc/ivette/pandac/fork_obs2ioda/obs2ioda/obs2ioda-v2/src. You can take a look at my results in /glade/derecho/scratch/ivette/pandac/ivette_3dvar_OIE120km_ColdStart_TEST_2/Observations/2022020106 .
— Reply to this email directly, view it on GitHub https://github.com/NCAR/MPAS-Workflow/issues/278#issuecomment-1906694816, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFASISJO5IAGBCBUHZSEHPLYP77WJAVCNFSM6AAAAABCF36EUCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMBWGY4TIOBRGY . You are receiving this because you were mentioned.Message ID: @.***>
@ibanos90 When I do a grep, I see other places with 45GB. You may want to fix them all in once.
initialize/applications/EnKF.py: 'memory': {'def': '45GB', 't': str},
initialize/applications/Forecast.py: 'memory': {'def': '45GB', 'typ': str},
initialize/applications/HofX.py: 'memory': {'def': '45GB', 'typ': str},
initialize/applications/RTPP.py: 'memory': {'def': '45GB', 'typ': str},
initialize/applications/Variational.py: 'memory': {'def': '45GB', 'typ': str},
initialize/config/Task.py: maxMemPerNode = "45GB"
initialize/post/VerifyModel.py: 'memory': {'def': '45GB', 'typ': str},
initialize/post/VerifyObs.py: 'memory': {'def': '45GB', 'typ': str},
scenarios/defaults/enkf.yaml: memory: 45GB
scenarios/defaults/enkf.yaml: memory: 45GB
scenarios/defaults/enkf.yaml: memory: 45GB
scenarios/defaults/enkf.yaml: # 16 x 8 PE x 2 omp : 90.5 min., 480 GB, 45GB/node
scenarios/defaults/enkf.yaml: # 16 x 8 PE x 4 omp : 92.5 min., 480 GB, 45GB/node
scenarios/defaults/enkf.yaml: memory: 45GB
scenarios/defaults/enkf.yaml: memory: 45GB
scenarios/defaults/enkf.yaml: memory: 45GB
scenarios/defaults/enkf.yaml: memory: 45GB
scenarios/defaults/enkf.yaml: memory: 45GB
scenarios/defaults/rtpp.yaml: memory: 45GB
scenarios/defaults/variational.yaml: memory: 45GB
scenarios/defaults/variational.yaml: ##memory: 45GB
test/testinput/3denvar_O30kmIE60km_WarmStart.yaml: memory: 45GB
test/testinput/3dvar_O30kmIE60km_ColdStart.yaml: memory: 45GB
And I do not know specifying memory is still necessary on derecho. Remember that we specify it as 45GB or 109GB on cheyenne because that will request the regular nodes or 'large-memory' nodes for a job. But all Derecho's nodes have the same 235GB memory. Perhaps we should simply comment out or remove those lines.
Hi @liujake, I just run the tests with ColdStart and they are completing successfully for me. I asked @junmeiban to run one of the ColdStart test and the Observations generation worked without issues, using the obs2ioda.x
executable that is already specified in initialize/framework/Build.py
. However, just to make sure, would you try one more time running one of the ColdStart tests using the current develop branch?
Ok. Then I am closing this issue.
Just double checked that all three 3dvar tests (out of 9 tests) pass.
They are 3dvar_OIE120km_WarmStart
, 3dvar_OIE120km_ColdStart
, and 3dvar_O30kmIE60km_ColdStart
.
Just double checked that all three 3dvar tests (out of 9 tests) pass. They are
3dvar_OIE120km_WarmStart
,3dvar_OIE120km_ColdStart
, and3dvar_O30kmIE60km_ColdStart
.
Great, thanks for letting us know!
I tried to run the workflow for the standard tests on derecho, but all failed for me! Here is what I did:
CondaError: Run 'conda init' before 'conda activate'
./submit.csh cylc version: 8.2.2 ./submit.csh (INFO): checking if a suite with the same name is already running ./submit.csh (INFO): confirmed that a cylc suite named liuz_ForecastFromGFSAnalysesMPT_TEST is not already running ./submit.csh (INFO): starting a new suite... ./submit.csh cylcWorkDir /glade/derecho/scratch/liuz/cylc-run ./submit.csh SuiteName liuz_ForecastFromGFSAnalysesMPT_TEST ./submit.csh mainScriptDir /glade/derecho/scratch/liuz/pandac/liuz_ForecastFromGFSAnalysesMPT_TEST/MPAS-Workflow ./submit.csh cylc install --run-name=liuz_ForecastFromGFSAnalysesMPT_TEST WorkflowFilesError: Failed to install from /glade/derecho/scratch/liuz/pandac/liuz_ForecastFromGFSAnalysesMPT_TEST/MPAS-Workflow: previous installations were from /glade/derecho/scratch/liuz/pandac/liuz_3dvar_OIE120km_WarmStart_TEST/MPAS-Workflow ./submit.csh cylc validate MPAS-Workflow/liuz_ForecastFromGFSAnalysesMPT_TEST WARNING - deprecated items were automatically upgraded in "workflow definition" WARNING - (8.0.0) [visualization] - DELETED (OBSOLETE) WARNING - (8.0.0) [scheduling]max active cycle points -> [scheduling]runahead limit - "4" -> "P3" WARNING - * (8.0.0) [runtime][Clean][job]execution retry delays -> [runtime][Clean]execution retry delays - value unchanged