benchcab not working with latest MAIN(?)

har917 commented 2 months ago

@ccarouge @SeanBryan51 @abhaasgoyal As of 17/7/2024 - I'm having difficulty getting benchcab to run (anything).

First issue - following recent updates to check%ranges the current default namelist (so what is supposed to be used for regression testing) still has check%ranges = .false. when created via git clone (and so the runs fail).

Using

realisations:
  - repo:
      git:
        branch: main
    patch:
      cable:
        check:
          ranges: 0
  - repo: 
      git: 
        branch: 335-facilitate-output-of-potential-evaporation-directly-from-the-offline-code-base
    patch:
      cable:
        check:
          ranges: 0

modules: [
  intel-compiler/2021.1.1,
  netcdf/4.7.4,
  openmpi/4.1.0
]

as the benchcab.yaml file appears to successfully create cable.nml files with the correct entries.

However benchcab then throws (in the qsub.sh.o*** file)

/g/data/hh5/public/apps/miniconda3/envs/analysis3-24.04/bin/benchcab fluxsite-run-tasks --config=config.yaml
Traceback (most recent call last):
  File "/g/data/hh5/public/apps/miniconda3/envs/analysis3-24.04/bin/benchcab", line 10, in <module>
    sys.exit(main())
  File "/g/data/hh5/public/apps/miniconda3/envs/analysis3-24.04/lib/python3.10/site-packages/benchcab/main.py", line 42, in main
    parse_and_dispatch(parser)
  File "/g/data/hh5/public/apps/miniconda3/envs/analysis3-24.04/lib/python3.10/site-packages/benchcab/main.py", line 32, in parse_and_dispatch
    func(**args)
  File "/g/data/hh5/public/apps/miniconda3/envs/analysis3-24.04/lib/python3.10/site-packages/benchcab/benchcab.py", line 279, in fluxsite_run_tasks
    config = self._get_config(config_path)
  File "/g/data/hh5/public/apps/miniconda3/envs/analysis3-24.04/lib/python3.10/site-packages/benchcab/benchcab.py", line 136, in _get_config
    self._config = read_config(config_path)
  File "/g/data/hh5/public/apps/miniconda3/envs/analysis3-24.04/lib/python3.10/site-packages/benchcab/config.py", line 165, in read_config
    config = read_config_file(config_path)
  File "/g/data/hh5/public/apps/miniconda3/envs/analysis3-24.04/lib/python3.10/site-packages/benchcab/config.py", line 139, in read_config_file
    with Path.open(Path(config_path), "r", encoding="utf-8") as file:
  File "/g/data/hh5/public/apps/miniconda3/envs/analysis3-24.04/lib/python3.10/pathlib.py", line 1119, in open
    return self._accessor.open(self, mode, buffering, encoding, errors,
FileNotFoundError: [Errno 2] No such file or directory: 'config.yaml'

Interestingly the spatial runs appear to have completed successfully. I don't see a .yaml file in the runs/fluxsite directory which is consistent with the error message.

Any thoughts?

abhaasgoyal commented 2 months ago

@har917 not sure, but I think the issue is that config file name should be config.yaml instead of benchcab.yaml (also the namelist files have been updated in bench_example and cable so no patch is needed - this is why the namelist file contents could be correct).

har917 commented 2 months ago

Perhaps to add more detail.

This testing is based off is a fresh git clone (as of today) - the cable.nml that is downloaded into the benchcab_example/namelists directory has the old check%ranges = .false. line

In the above (and this wasn't clear - apologies) the file that I refer to as benchcab.yaml is the config.yaml file that the user edits in the benchcab_example root directory (I named it that because there are other config.yaml files created elsewhere in the structure)

Is there supposed to be a config.yaml file created in the benchcab_exmaple/runs/fluxsite/ directory (equivalent to the .yaml files created in the spatial/crujraaccess* directories)?

abhaasgoyal commented 2 months ago

I see, regarding bench_example we still have to merge the (approved) PR https://github.com/CABLE-LSM/bench_example/pull/23 (~so it will be done soon~ edit: we still need to see how to manage namelist compability)

Now, the following set of commands seem to work for me

$ git clone git@github.com:CABLE-LSM/bench_example.git
$ cd bench_example
$ vim config.yaml
# Following lines go in this file
realisations:
  - repo:
      git:
        branch: main
    patch:
      cable:
        check:
          ranges: 0
  - repo: 
      git: 
        branch: 335-facilitate-output-of-potential-evaporation-directly-from-the-offline-code-base
    patch:
      cable:
        check:
          ranges: 0

modules: [
  intel-compiler/2021.1.1,
  netcdf/4.7.4,
  openmpi/4.1.0
]
$ benchcab run -v

By any chance, was benchcab run from another directory?

Also, there shouldn't be any config.yaml directory in benchcab_example/runs/fluxsite

har917 commented 2 months ago

By any chance, was benchcab run from another directory?

I don't think so though - there is a possibility that I ran it from one layer too high but I thought it would completely fail if I did that (I have a /benchcab directory on scratch into which git clone creates the /benchcab_example directory and think I ran from /benchcab_example). All this was run via a VS code terminal.

I didn't use the -v option - is that important?

abhaasgoyal commented 2 months ago

I didn't use the -v option - is that important?

Not really, it is for verbose output (just to check whether there were any warnings/issues before submitting job)

Maybe it detected a config.yaml on top/environment path (little chance but just in case). It seems to work well for me, but maybe somebody else (@SeanBryan51 @ccarouge) can recreate this issue. Meanwhile @har917 maybe run the above set of commands from /scratch and if you could recheck that'd be great.

har917 commented 2 months ago

I've just completed a completely fresh run using the commands above - with the only thing different being that I used the VS code editor not vim (since I'm not a vim user).

It's failed in the same way - it's

compiled the two branches successfully,
created the expected output directory structure
written namelists into the runs/fluxsite/tasks/ directory with the correct check%ranges entry
done something in the spatial section (it's created an N9.o** file in each cru_access/task directory)
but failed with the same error as above.

Likely contradicting my earlier thinking - I'm not sure it's done anything in the payu section in that there's notthing in the work directory (only in the archive directory).

One thing I've just thought of - is there a project dependence somewhere in here? I've been running these tests from p66 - should I try from a different project (e.g. x45, rp23).

har917 commented 2 months ago

@AlisonBennett Could you have a go at following the instructions (4^) from @abhaasgoyal above to see whether you can get this to run?

Just trying to figure out whether this is at my end or somewhere else.

ps. you'll get to see how quickly the updated compilation/build is - only takes a couple of minutes in contrast to 15+ with BLAZE_9814

AlisonBennett commented 2 months ago

@har917 yes - I have done this and it seems to have run (ie. it built some stuff and then submitted a pbs job which took a while to run through and now there is a bunch of extra stuff in some new directories). I'm not really sure what output to expect though, so perhaps it's best for you to have a look at scratch/x45/ab7412/benchcab_test to see if that is what it is mean to do.

There were a few errors before I got this far. To overcome these, I had to: a) copy @abhaasgoyal's code for the .yaml file into my .yaml file (before I did that I got lots of errors very similar to your initial post). I think the yaml syntax is very fussy. You could try taking a copy of my .yaml file to see if that solves your problem? b) follow instructions to load benchcab modules here (before I did that my environment didn't know about benchcab) c) start a new arc session with adding access to both projects gdata/hh5 and gdata/ks32 (before I did that it said it didn't have access to the meteorology for one of the flux sites).

Hope this helps.

ccarouge commented 2 months ago

@har917 mind sharing the path where you are running from?

ccarouge commented 2 months ago

Actually @har917 what's the -l storage line in the qsub job and where do you run? Are you running from /g/data/p66 and it isn't in the -l storage line for example?

har917 commented 2 months ago

@ccarouge I've been running from /scratch/x45 but likely submitted the job under p66 (as that's my default project).

@AlisonBennett has successfully run the regression test (under x45) this morning.

I'm trying again (but ensuring that I'm under x45) - and this is certainly behaving differently (in that it's produced fluxsite outputs) however it hasn't produced a benchmark_cable_qsub.sh.o*** file even though the job has apparently finished (via qstat)

the -l storage line is both sets of runs is #PBS -l storage=gdata/ks32+gdata/hh5+gdata/wd9

Basically I think the problem is that I've been essentially asking a job under p66 to write to scratch under x45 and it's said no (understandable) - but the error message is a bit odd.

On further thought - what's likely happened is that benchcab tries to copy the config.yaml file from its root directory to somewhere else as part of the workflow (that fails because of the gadi permissions requirement), then benchcab tries to read the copy of the config.yaml file (which doesn't exist) and you get the error above.

Perhaps a note in the benchcab 'how to' about matching project with the PBS storage and/or matching project with calling point is needed

EDIT: it's now produced a .o*** file so all good.

ccarouge commented 2 months ago

@har917 When you run the job using p66, Gadi will automatically mount /scratch/p66 but not /scratch/x45. If you run using x45 resources, Gadi will mount /scratch/x45 (and not /scratch/p66). In config.yaml, it's possible to give additional projects to mount: https://benchcab.readthedocs.io/en/latest/user_guide/config_options/#+pbs.storage You may want to add scratch/x45 so it works no matter what resources are used

Edit: I'm assuming you run switchproj before running benchcab since we haven't provided a way to run benchcab under a different project as the current project of the user.

CABLE-LSM / benchcab

benchcab not working with latest MAIN(?) #302