NCAR / MPAS-Workflow

Scripts for controlling DA workflows with MPAS-Model and mpas-bundle
Apache License 2.0
20 stars 15 forks source link

Need to make all 9 tests in test1.yaml work on derecho #278

Closed liujake closed 7 months ago

liujake commented 8 months ago

I tried to run the workflow for the standard tests on derecho, but all failed for me! Here is what I did:

  1. I am in the develop branch of MPAS-Workflow after cloning the workflow.
  2. Then do 'source env-setup/machine.csh', as I use tcsh. Now I am in the Jim's Cylc-8 python env.
  3. Then I type './test.csh', it produced a lot of warning or Error message below during submission, but should be Ok according to Tao.
    
    Directory: /glade/work/liuz/pandac2024/MPAS-Workflow
    setenv: Variable name must contain alphanumeric characters.
    ./submit.csh (INFO): Generating the scenario-specific MPAS-Workflow directory
    ./submit.csh cd /glade/derecho/scratch/liuz/pandac/liuz_ForecastFromGFSAnalysesMPT_TEST/MPAS-Workflow
    Directory: /glade/derecho/scratch/liuz/pandac/liuz_ForecastFromGFSAnalysesMPT_TEST/MPAS-Workflow
    ./submit.csh conda activate /glade/work/jwittig/conda-envs/my-cylc8.2

CondaError: Run 'conda init' before 'conda activate'

./submit.csh cylc version: 8.2.2 ./submit.csh (INFO): checking if a suite with the same name is already running ./submit.csh (INFO): confirmed that a cylc suite named liuz_ForecastFromGFSAnalysesMPT_TEST is not already running ./submit.csh (INFO): starting a new suite... ./submit.csh cylcWorkDir /glade/derecho/scratch/liuz/cylc-run ./submit.csh SuiteName liuz_ForecastFromGFSAnalysesMPT_TEST ./submit.csh mainScriptDir /glade/derecho/scratch/liuz/pandac/liuz_ForecastFromGFSAnalysesMPT_TEST/MPAS-Workflow ./submit.csh cylc install --run-name=liuz_ForecastFromGFSAnalysesMPT_TEST WorkflowFilesError: Failed to install from /glade/derecho/scratch/liuz/pandac/liuz_ForecastFromGFSAnalysesMPT_TEST/MPAS-Workflow: previous installations were from /glade/derecho/scratch/liuz/pandac/liuz_3dvar_OIE120km_WarmStart_TEST/MPAS-Workflow ./submit.csh cylc validate MPAS-Workflow/liuz_ForecastFromGFSAnalysesMPT_TEST WARNING - deprecated items were automatically upgraded in "workflow definition" WARNING - (8.0.0) [visualization] - DELETED (OBSOLETE) WARNING - (8.0.0) [scheduling]max active cycle points -> [scheduling]runahead limit - "4" -> "P3" WARNING - * (8.0.0) [runtime][Clean][job]execution retry delays -> [runtime][Clean]execution retry delays - value unchanged


4. I see one problem with cold-start test cases is the obs2ioda.x issue, as noted in this issue https://github.com/NCAR/MPAS-Workflow/issues/261, but do not know why other tests all failed or stopped from cylc log files in /glade/u/home/liuz/cylc-run/MPAS-Workflow. Note that I did not change anything in the workflow, just using the current settings of paths.

Anyone recently tried the workflow with test.csh? Am I missing something in the steps I listed.
We'd better make all standard tests work on derecho with 2.1 code.
junmeiban commented 8 months ago

Hi Jake, Would you please copy my .cylc/flow/global.cylc on my home dir to your global.cylc first?

liujake commented 8 months ago

Thank you @junmeiban . It seems running after copying your global.cylc file below.

[platforms]
    # The localhost platform is available by default
    # [[localhost]]
    #     hosts = localhost
    #     install target = localhost
    [[pbs_cluster]]
        hosts = localhost
        job runner = pbs
        install target = localhost
    [install]
       [[symlink dirs]]
          [[[localhost]]]
            run = /glade/derecho/scratch/$USER/

My previous global.cylc copied from Tao, which missed the symlink part, so cylc-run was under 'home' instead of 'scratch' space. But it was consistent with the online README of the workflow document.

@jim-p-w Can you update online readme for the global.cylc file? Also README should mention the step to 'source machine.csh/sh'. There was a step to 'source cheyenne.csh' in the old readme. It is missing in the current readme.

liujake commented 8 months ago

Does anyone have a newly-compiled obs2ioda.x somewhere on derecho to fix the issue https://github.com/NCAR/MPAS-Workflow/issues/261?

mos3r3n commented 8 months ago

Please find my obs2ioda-v2.x under /glade/work/taosun/Derecho/MPAS/Obs2IODA/src

If you want to build the executable of your own, you can refer to the Makefile for building with gnu and Makefile.intel for building with intel. One thing I want to emphasize is that I have made some modifications to the code to let the output IODA file not contain observation sites that have bad QC flags.

mos3r3n commented 8 months ago

In cylc 8, we can use cylc tui $workflow-id to check the status of a running suite.

liujake commented 8 months ago

@mos3r3n We should use the standard version of obs2ioda instead of your modified version. @zhumingying may have one build code of obs2ioda as she recently ran some obs conversion.

jim-p-w commented 8 months ago

@liujake Regarding the README file in the develop branch, I think the steps are correct.

It is no longer necessary to source env-setup/machine.csh or env-setup/machine.sh.

The symlink section of the $HOME/.cylc/flow/global.cylc is optional. If you don't have that section, data files will be created in $HOME, but the workflows should run (assuming there's room in $HOME).

Note the cylc configuration file in $HOME/.cylc now needs to be in a subdirectory called flow, and is now named global.cylc, so the config file has changed from $HOME/.cylc/global.rc to $HOME/.cylc/flow/global.cylc

liujake commented 8 months ago

@jim-p-w Yes. I have global.cylc under $HOME/.cylc/flow. But if no having 'optional' symlink section, my test.csh failed. With this section, it worked Ok with apparently 6 of 9 successful. 2 failed ColdStart ones below are due to the old obs2ioda.x used:

3dvar_O30kmIE60km_ColdStart_TEST
liuz_3dvar_OIE120km_ColdStart_TEST

Another failing one is

3denvar_O30kmIE60km_WarmStart_TEST

in the DA step, but without an error message in

/glade/derecho/scratch/liuz/pandac/liuz_3denvar_O30kmIE60km_WarmStart_TEST/CyclingDA/2018041500/run/mem001/jedi.log

So you can run test.csh successfully (e.g., 6 of 9) without the symlink section in your global.cylc?

liujake commented 8 months ago

@jim-p-w 'source env-setup' is to activate your python environment with cylc-8 installation. I do not understand why it is optional? You meant we can use the standard CISL python env or spack-stack's python env? If so, we still need to inform in the README that what/how to load those python/cylc-8 environment before run test.csh, right?

jim-p-w commented 8 months ago

@liujake The scripts and python code set up the environment. It is loading the cylc 8 environment installed in /glade/work/jwittig/conda-envs/my-cylc8.2

I am running a bash shell, I will need to retest running tcsh. That won't happen till tomorrow (it can take a day for a new shell to get set up on derecho using SAM). I ran test.csh using tcsh, all the workflows ran (I am seeing two workflows fail, regardless of environment in the Variational1 step).

I am running w/o the symlink section in ~/.cylc/flow/global.cylc, and the workflows are running. This will also take a while for results, all of my jobs being submitted to compute nodes are stuck in the queue.

zhumingying commented 8 months ago

I have not compiled obs2ioda-v2 in Derecho. I processed observations in Casper using the old executable obs2ioda-v2.x, which was compiled in Cheyenne and worked well in Casper.

On Mon, Jan 22, 2024 at 2:13 PM Zhiquan (Jake) Liu @.***> wrote:

Assigned #278 https://github.com/NCAR/MPAS-Workflow/issues/278 to @zhumingying https://github.com/zhumingying.

— Reply to this email directly, view it on GitHub https://github.com/NCAR/MPAS-Workflow/issues/278#event-11562063335, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFASISMJCHOGZQA6HU4YFVDYP3I63AVCNFSM6AAAAABCF36EUCVHI2DSMVQWIX3LMV45UABCJFZXG5LFIV3GK3TUJZXXI2LGNFRWC5DJN5XDWMJRGU3DEMBWGMZTGNI . You are receiving this because you were assigned.Message ID: @.***>

ibanos90 commented 8 months ago

Hi @liujake, my global.cylc (in /glade/u/home/ivette/.cylc/flow) does not contain the symlink section and the experiments are running successfully:

    # The localhost platform is available by default
    # [[localhost]]
    #     hosts = localhost
    #     install target = localhost
    [[pbs_cluster]]
        hosts = localhost
        job runner = pbs
        install target = localhost

One thing I should point out is that I added the lines below to my .bashrc, which seems to be enough to be able to submit the suites, given that it loads the right python environment. We may need to revisit this in the Readme and the machine.sh/machine.csh script.

export CYLC_ENV=/glade/work/jwittig/conda-envs/my-cylc8.2
source /etc/profile.d/z00_modules.sh
module purge
module load ncarenv/23.09
module load conda/latest

conda activate $CYLC_ENV

I ran the test 3denvar_O30kmIE60km_WarmStart_TEST and it seems to fail because of memory issues. In the scenario yaml (test/testinput/3denvar_O30kmIE60km_WarmStart.yaml) we specify memory: 45GB, which overwrites the default value (235GB) in scenarios/defaults/variational.yaml. I ran another test removing that part from that test scenario yaml file and it completed successfully. We should open a PR to fix this.

Regarding the ColdStart tasks, I made a new compilation on Cheyenne (to solve the issue https://github.com/NCAR/MPAS-Workflow/issues/261) that seems to be working on Derecho. I just ran 3dvar_OIE120km_ColdStart_TEST and it worked correctly using the obs2ioda executable in /glade/campaign/mmm/parc/ivette/pandac/fork_obs2ioda/obs2ioda/obs2ioda-v2/src. You can take a look at my results in /glade/derecho/scratch/ivette/pandac/ivette_3dvar_OIE120km_ColdStart_TEST_2/Observations/2022020106.

liujake commented 8 months ago

@ibanos90 Thanks for testing and fixing the issues. Feel free to make a PR with your fixes.

liujake commented 8 months ago

And It is a mystery to me about whether we need symlink section in global.cylc.

ibanos90 commented 8 months ago

And It is a mystery to me about whether we need symlink section in global.cylc.

Yeah, I don't understand it either. I think it should be fine as long as we have space in the home directory.

zhumingying commented 8 months ago

Yes, I tried to run the experiment of O30kmIE60km with CrIS, it failed on forecast if we use default setting of 1X128 or 2X128 processors, we need to set:

forecast:

job:

30km:

  nodes: 1

  PEPerNode: 128

  memory: 235GB

Job ID User Queue Nodes NCPUs NGPUs Finish Time Req Mem Used Mem(GB) Avg CPU (%) Elapsed ( h) Job Name

2855797 zhuming cpu 1 128 0 01-19T12:02 45.0 45.0 26.6 0.01 Forecast1.20180424T1200Z.MPAS-Worflow

2885135 zhuming cpu 2 256 0 01-22T16:56 90.0 90.0 43.5 0.01 Forecast1.20180425T0000Z.MPAS-Workflow-zhuming_3dhybrid-60-60-iter_O30kmI60km_VarBC_bnd_iasi_cris

2881537 zhuming cpu 1 128 0 01-22T11:49 235.0 133.9 100.0 0.28 Forecast1.20180424T1800Z.MPAS-Workflow-zhuming_3dhybrid-60-60-iter_O30kmI60km_VarBC_bnd_iasi_cris

2887419 zhuming cpu 2 256 0 01-22T21:26 470.0 140.4 100.0 0.10 Forecast1.20180425T0600Z.MPAS-Workflow-zhuming_3dhybrid-60-60-iter_O30kmI60km_VarBC_bnd_iasi_cris

Zhuming

On Tue, Jan 23, 2024 at 11:39 AM ibanos90 @.***> wrote:

Hi @liujake https://github.com/liujake, my global.cylc (in /glade/u/home/ivette/.cylc/flow) does not contain the symlink section and the experiments are running successfully:

# The localhost platform is available by default
# [[localhost]]
#     hosts = localhost
#     install target = localhost
[[pbs_cluster]]
    hosts = localhost
    job runner = pbs
    install target = localhost

One thing I should point out is that I added the lines below to my .bashrc, which seems to be enough to be able to submit the suites, given that it loads the right python environment. We may need to revisit this in the Readme and the machine.sh/machine.csh script.

export CYLC_ENV=/glade/work/jwittig/conda-envs/my-cylc8.2 source /etc/profile.d/z00_modules.sh module purge module load ncarenv/23.09 module load conda/latest

conda activate $CYLC_ENV

I ran the test 3denvar_O30kmIE60km_WarmStart_TEST and it seems to fail because of memory issues. In the scenario yaml ( test/testinput/3denvar_O30kmIE60km_WarmStart.yaml) we specify memory: 45GB, which overwrites the default value (235GB) in scenarios/defaults/variational.yaml. I ran another test removing that part from that test scenario yaml file and it completed successfully. We should open a PR to fix this.

Regarding the ColdStart tasks, I made a new compilation on Cheyenne (to solve the issue #261 https://github.com/NCAR/MPAS-Workflow/issues/261) that seems to be working on Derecho. I just ran 3dvar_OIE120km_ColdStart_TEST and it worked correctly using the obs2ioda executable in /glade/campaign/mmm/parc/ivette/pandac/fork_obs2ioda/obs2ioda/obs2ioda-v2/src. You can take a look at my results in /glade/derecho/scratch/ivette/pandac/ivette_3dvar_OIE120km_ColdStart_TEST_2/Observations/2022020106 .

— Reply to this email directly, view it on GitHub https://github.com/NCAR/MPAS-Workflow/issues/278#issuecomment-1906694816, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFASISJO5IAGBCBUHZSEHPLYP77WJAVCNFSM6AAAAABCF36EUCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMBWGY4TIOBRGY . You are receiving this because you were mentioned.Message ID: @.***>

liujake commented 7 months ago

@ibanos90 When I do a grep, I see other places with 45GB. You may want to fix them all in once.

initialize/applications/EnKF.py:      'memory': {'def': '45GB', 't': str},
initialize/applications/Forecast.py:      'memory': {'def': '45GB', 'typ': str},
initialize/applications/HofX.py:      'memory': {'def': '45GB', 'typ': str},
initialize/applications/RTPP.py:        'memory': {'def': '45GB', 'typ': str},
initialize/applications/Variational.py:      'memory': {'def': '45GB', 'typ': str},
initialize/config/Task.py:  maxMemPerNode = "45GB"
initialize/post/VerifyModel.py:      'memory': {'def': '45GB', 'typ': str},
initialize/post/VerifyObs.py:      'memory': {'def': '45GB', 'typ': str},
scenarios/defaults/enkf.yaml:          memory: 45GB
scenarios/defaults/enkf.yaml:          memory: 45GB
scenarios/defaults/enkf.yaml:          memory: 45GB
scenarios/defaults/enkf.yaml:          # 16 x 8 PE x 2 omp : 90.5 min., 480 GB, 45GB/node
scenarios/defaults/enkf.yaml:          # 16 x 8 PE x 4 omp : 92.5 min., 480 GB, 45GB/node
scenarios/defaults/enkf.yaml:          memory: 45GB
scenarios/defaults/enkf.yaml:          memory: 45GB
scenarios/defaults/enkf.yaml:          memory: 45GB
scenarios/defaults/enkf.yaml:          memory: 45GB
scenarios/defaults/enkf.yaml:          memory: 45GB
scenarios/defaults/rtpp.yaml:      memory: 45GB
scenarios/defaults/variational.yaml:          memory: 45GB
scenarios/defaults/variational.yaml:          ##memory: 45GB
test/testinput/3denvar_O30kmIE60km_WarmStart.yaml:          memory: 45GB
test/testinput/3dvar_O30kmIE60km_ColdStart.yaml:          memory: 45GB
liujake commented 7 months ago

And I do not know specifying memory is still necessary on derecho. Remember that we specify it as 45GB or 109GB on cheyenne because that will request the regular nodes or 'large-memory' nodes for a job. But all Derecho's nodes have the same 235GB memory. Perhaps we should simply comment out or remove those lines.

ibanos90 commented 7 months ago

Hi @liujake, I just run the tests with ColdStart and they are completing successfully for me. I asked @junmeiban to run one of the ColdStart test and the Observations generation worked without issues, using the obs2ioda.x executable that is already specified in initialize/framework/Build.py. However, just to make sure, would you try one more time running one of the ColdStart tests using the current develop branch?

liujake commented 7 months ago

Ok. Then I am closing this issue.

byoung-joo commented 7 months ago

Just double checked that all three 3dvar tests (out of 9 tests) pass. They are 3dvar_OIE120km_WarmStart, 3dvar_OIE120km_ColdStart, and 3dvar_O30kmIE60km_ColdStart.

ibanos90 commented 7 months ago

Just double checked that all three 3dvar tests (out of 9 tests) pass. They are 3dvar_OIE120km_WarmStart, 3dvar_OIE120km_ColdStart, and 3dvar_O30kmIE60km_ColdStart.

Great, thanks for letting us know!