Open oyvindseland opened 4 months ago
@oyvindseland - can you summarize the exact version of the diagnostic package you were using. I plan to contact the NCAR folks to see how they handle this.
I did not create the set-up but as far as I can see it is Script Version: 140804 I checked the svn site and it looks like the most recent svn release. https://svn-ccsm-release.cgd.ucar.edu/model_diagnostics/atm/cam/ revision 231
Did you run a simulation with original SE output as well? @mvertens
Checked the version on Nird and it is the same as on Betzy.
@oyvindseland - I have not run a simulation with just SE output yet. We are still moving and everything is totally chaotic today. I'll start one tomorrow.
No worries, I do not sit around waiting for it.
@gold2718 - could you please help with this as well?
Information about diagnostics can be found at https://noresm-docs.readthedocs.io/en/noresm2/diagnostics/diagnostics.html
On betzy the command is /cluster/shared/noresm/diagnostics/noresm/bin/diag_srun
Default amwg script at /cluster/shared/noresm/diagnostics/noresm/packages/CAM_DIAG Before it actually runs the scripts it is by default copied to /cluster/work/users/$user/diagnostics/out/CAM_DIAG/config/$CASENAME/run_scripts Path can be changed by the script and it can also create the scripts without running.
@oyvindseland - @gold2718 has forked the repository and I have downloaded it to /cluster/shared/noresm/diagnostics/noresm_dev on betzy. I would like first to reproduce your error. What was your command to diag_srun that resulted in this failure?
Command that failed
/cluster/shared/noresm/diagnostics/noresm/bin/diag_srun -m cam -i /cluster/work/users/mvertens/archive -c NB1850proto01 -s 2 -e 11
So I changed all the variable name from w -> gw in all the cam history files. Now It is dying with the following error: nco_err_exit(): ERROR Short NCO-generated message (usually name of function that triggered error): nco_get_var1() nco_err_exit(): ERROR Short NCO-generated message (usually name of function that triggered error): nco_get_var1() nco_err_exit(): ERROR Error code is 12. nco_err_exit(): ERROR Error code is 12. Translation into English with nc_strerror(12) is "Cannot allocate memory" Translation into English with nc_strerror(12) is "Cannot allocate memory" ERROR: nco_get_var1() failed to nc_get_var1() variable "time_bnds" nco_err_exit(): ERROR NCO will now exit with system call exit(EXIT_FAILURE) ERROR: nco_get_var1() failed to nc_get_var1() variable "time_bnds" nco_err_exit(): ERROR NCO will now exit with system call exit(EXIT_FAILURE) I believe that the version we are using of the CAM diagnostic package is no longer compatible with the CAM history output for the development code.
So I scrubbed everything and tried again - and got totally different errors. See /cluster/work/users/mvertens/diagnostics/logs/-diagsrun-240213-194000.log. @oyvindseland - can you try running the script again and see if you get anything different.
I reran the script and also got an OOM error. I do not think I have seen an out of memory issue in the diagnostics before so I do not understand why this is. Just need to ask for more memory in the script? I should add though that I rarely use the script on Betzy but on Nird.
I did copy year 2 and 3 of your output files, renamed gw and ran the amwg script without the wrapper.
In this case the script runs but have only relatively limited output. The output claims that the variable hyam is missing fatal:["Execute.c":6394]:variable (hyam) is not in file (inptr) Also did the same for your set-up and got the same result, some plots and the "hyam error"
Plots: https://ns2345k.web.sigma2.no/diagnostics/noresm/oyvinds/NB1850proto01/ For comparison 20 years of CMIP6 piControl: https://ns2345k.web.sigma2.no/diagnostics/noresm/oyvinds/N1850frc2_f09_tn14_20191001/CAM_DIAG/
@oyvindseland - I think the problem is that on betzy the wrapper is submitted to the preproc queue which is a shared memory batch node. So depending on who else is using it will limit the memory available. This explains I think why the OOM appeared in different places each time the wrapper was submitted on betzy. When you just run the script itself interactively you are using the shared memory of the login node. I think running on Nird is probably better. BTW - I changed the variable from w -> gw in all of the files. The fact that the variable is denoted as missing which is not on the input file is problematic. @gold2718 - where are the latest version(s) of the CAM diagnostic packages. Is anything available on github at this point?
On nird the script runs without OOM but the hyam problem is still the same.
A test with native grid output created the same plots as the coupled simulation. The definition of vertical levels, hyam and hybm are still missing from the averaged files. https://ns2345k.web.sigma2.no/diagnostics/noresm/oyvinds/NF2000proto01/
The interpolation of SE onto a lat-lon grid in the diagnostics fails, see e.g. https://ns2345k.web.sigma2.no/diagnostics/noresm/oyvinds/NB1850proto01/yrs2to3-obs/set5_6/set5_ANN_LWCF_ERBE_obsc.png vs https://ns2345k.web.sigma2.no/diagnostics/noresm/oyvinds/NF2000proto01/yrs1to1-obs/set5_6/set5_ANN_LWCF_ERBE_obsc.png
I looked around at the amwg website and I found some diagnostics plot with SE and 48 Levels so it should be possible if we need to use the ncl diagnostics The simulations were relatively old (2021) https://webext.cgd.ucar.edu/FWscHIST/f.e21.FWscHIST_BGC.ne30_ne30_mg17_L48_revert-J.001/atm/
The table that linked in the simulations did not say who created the plots or did the simulations https://docs.google.com/spreadsheets/d/1nSTQ9tscsqeLhy3fhytW_ko1wLjydYqa5ZRGThLP2K8/edit#gid=1338712341
Issue Type
Other (please describe below)
Issue Description
I have tried to run cam-diagnostics on the simulation found at Betzy: /cluster/work/users/mvertens/archive/NB1850proto01 The simulations are with SE dycore with output regridded to FV 0.9x1.25 degree grid
The diagnostics simulation fails with the error message:(0) unstructured_to_ESMF: latitude and longitude must have the same number of elements: /cluster/work/users/oyvinds/diagnostics/out/CAM_DIAG/config/NB1850proto01/logs/out_240208_153621.log Prior to the fail, the averaged files are given an SE name e.g /cluster/work/users/oyvinds/diagnostics/out/CAM_DIAG/climo/NB1850proto01/sav_se/NB1850proto01_01_000201_001101_climo_SE.nc Another possible point of failure is that the variable name used for latitude weights are w, not gw as used to be the standard in FV.
Possible test: A SE simulation of 14 months or more to see if the diagnostics tool can manage SE grid output.
Will this change answers?
No
Will you be implementing this yourself?
No