FESOM / fesom2

Multi-resolution ocean general circulation model.
http://fesom.de/
GNU General Public License v3.0
51 stars 49 forks source link

FESOM2 standalone simulation from documentation #496

Closed fernandadialzira closed 2 months ago

fernandadialzira commented 1 year ago

I need to do a fesom standalone simulation for a new mesh representing the mid-Pliocene ocean conditions. This issue is to update on errors encountered while following the documentation, as discussed with @koldunovn, @pgierz, @patrickscholz, and Sesh (I don't know his username).

Machine: levante Branch: refactoring Mesh: /work/ab0246/a270179/runtime/awicm3-v3.1/input/fesom2/midpli/ Model directory: /work/ab0246/a270179/runtime/awicm3-v3.1/model_codes/fesom-standalone/fesom2/

First step: build model executable

After doing this, the mesh partitioning was easy to perform following documentation.

Second step: running the model

&clockinit ! the model starts at timenew=0.0 daynew=1 yearnew=1990 /

&paths MeshPath='/work/ab0246/a270179/runtime/awicm3-v3.1/input/fesom2/midpli/' ClimateDataPath='/pool/data/AWICM/FESOM2/INITIAL/phc3.0/' ResultPath='/work/ab0246/a270179/runtime/awicm3-v3.1/experiments_testing/fesom-standalone/' /

***Notes:*** there should be more description on having to define a result path in `namelist.config`, and of what is exactly the run length.

- changes made to `/work/ab0246/a270179/runtime/awicm3-v3.1/model_codes/fesom-standalone/fesom2/work/job_levante`

SBATCH --job-name=midpli_test1

SBATCH -p compute

SBATCH --ntasks-per-node=108

SBATCH --ntasks=432

SBATCH --time=00:30:00

SBATCH -o slurm-out.out

SBATCH -e slurm-err.out

SBATCH -A ab0246

***Notes:*** None of the above-mentioned lines are described in the documentation, except the number of tasks, which depends on the distribution chosen. 

It does not run and slurm messages and other files are kept in the work folder, instead of in the results folder. Is it ok?
Error message in `slurm-err.out`:

/var/spool/slurmd/job6868133/slurm_script: line 14: /work/ab0246/a270179/runtime/awicm3-v3.1/model_codes/fesom_standalone/fesom2/env/levante.dkrz.de/shell.intel: No such file or directory 217: fesom.x: error while loading shared libraries: libmkl_intel_lp64.so.2: cannot open shared object file: No such file or directory

suvarchal commented 1 year ago

@fernandadialzira sorry for your troubles and thanks for the feedback. documentation was more for the master branch, few features of development branch are not documented yet, this helps what you report.

a quick comments: For the run part, not having libmkl_intel_lp64.so.2 is related to slurm not finding /work/ab0246/a270179/runtime/awicm3-v3.1/model_codes/fesom_standalone/fesom2/env/levante.dkrz.de/shell.intel this is puzzling as this file is also needed and used for compilation. Did you compile the model directly in your fesom2 directory using ./configure.sh, in that case did the configure script find the same shell file or did you compile the model using esmtools?

slurm logs usually come by default in the directory where you submit the job. are if you want then in results dir, easy trick would be copy slurm batch script into results dir.

fernandadialzira commented 1 year ago

Hi @suvarchal!

I compiled the model directly on the fesom2 directory using bash -l ./configure.sh, as in the documentation, and it worked fine once I was in the refactoring branch. Do you also have any clue to the other questions in the issue?

Thank you for the advice on the slurm logs!

patrickscholz commented 1 year ago

Hi @fernandadialzira & @mandresm ... so you us ESM-tools right? There is a typo mistake in the pathname of /work/ab0246/a270179/runtime/awicm3-v3.1/model_codes/fesom_standalone/fesom2/env/levante.dkrz.de/shel.intel your real path to that file seems to be /work/ab0246/a270179/runtime/awicm3-v3.1/model_codes/fesom-standalone/fesom2/env/levante.dkrz.de/shell.intel. Thats while he cant find it. I guess the problem is in that case somewhere in ESM-tools when it build its directory tree!

mandresm commented 1 year ago

I don't think she is using ESM-Tools, otherwise she wouldn't be using the job_levante script

fernandadialzira commented 1 year ago

Yes, I am not using esm-tools. But it is good that I am stuck based on a typo. I am going to continue by fixing it and let you know if there is any other errors

koldunovn commented 1 year ago

looks like you download FESOM with esm-tools. For "clear experiment" it would be probably better to clone it directly from repo?

fernandadialzira commented 1 year ago

looks like you download FESOM with esm-tools. For "clear experiment" it would be probably better to clone it directly from repo?

Hi, I did try with esm-tools before, but inside /work/ab0246/a270179/runtime/awicm3-v3.1/model_codes/fesom_standalone/fesom2 I was trying the clear experiment, cloning from the repo and following the instructions.

As I wanted to fix the typo shown by @patrickscholz, I deleted the repo and tried to clone and compile the model again. The compilation now did not work with the message:

ld: cannot find -lFALSE
make[2]: *** [src/CMakeFiles/fesom.dir/build.make:1642: src/fesom] Error 1
make[2]: Leaving directory '/work/ab0246/a270179/runtime/awicm3-v3.1/model_codes/fesom2/build'
make[1]: *** [CMakeFiles/Makefile2:140: src/CMakeFiles/fesom.dir/all] Error 2
make[1]: Leaving directory '/work/ab0246/a270179/runtime/awicm3-v3.1/model_codes/fesom2/build'
make: *** [Makefile:139: all] Error 2

In this way, fesom.x is not created and I can't try to run the model again. My only idea is that simply deleting the folder does not do the trick, is there a better way to uninstall the model and try clean again?

suvarchal commented 1 year ago

@fernandadialzira sorry for the late response.

can you please try ./configure.sh -DBLA_VENDOR=Intel10_64lp. (my suspicion is it is hard/different to discover blas from newer versions of imkl then what used to be)

fernandadialzira commented 1 year ago

@suvarchal thank you for your comment!

It worked, but only with a fully clean installation, so now, my fesom standalone directory is /work/ab0246/a270179/runtime/awicm3-v3.1/model_codes/fesom_standalone2/fesom2. With that, I was able to run 10 model years, and the output looks reasonable for a non-equilibrated run (in /work/ab0246/a270179/runtime/awicm3-v3.1/experiments_testing/mesh_sln_003)

Mid-pliocene SST (new mesh)

Pre-Industrial SST (280 ppmv)

However, to get into this, I had to make some changes, and this is perhaps the contribution to the documentation:

  1. If one wants to simulate more than 1 year in one go, one needs to set the #SBATCH --time=00:30:00 to a higher value in fesom2/work/job_levante. @JanStreffing has taught me that I could uncomment the last lines and the job would resubmit itself, but we did not know how to set an end date for the simulation. Therefore, for 10 years, I set the time to #SBATCH --time=04:30:00

  2. I had to add to job_levante the copying of namelist.tra, namelist.io, namelist.dyn and namelist.cvmix:

cp -n ../config/namelist.tra     .
cp -n ../config/namelist.io      .
cp -n ../config/namelist.cvmix   .
cp -n ../config/namelist.dyn     .

Otherwise I would get errors like:

  1. I had to change the path also for namelist.forcing. At the beginning, based on documentation, I have only changed namelist.config to:
&timestep
step_per_day=32 !96 !96 !72 !72 !45 !72 !96
run_length=10 !62 !62 !62 !28
run_length_unit='y'             ! y, m, d, s
/

&clockinit              ! the model starts at
timenew=0.0
daynew=1
yearnew=1990
/

&paths
MeshPath='/work/ab0246/a270179/runtime/awicm3-v3.1/input/fesom2/midpli/'
ClimateDataPath='/pool/data/AWICM/FESOM2/INITIAL/phc3.0/'
ResultPath='/work/ab0246/a270179/runtime/awicm3-v3.1/experiments_testing/mesh_sln_003/'
/

For namelist.forcing, I did:

&nam_sbc
   nm_xwind_file = '/pool/data/AWICM/FESOM2/FORCING/JRA55-do-v1.4.0/uas.'        ! name of file with wind speeds x
   nm_ywind_file = '/pool/data/AWICM/FESOM2/FORCING/JRA55-do-v1.4.0/vas.'        ! name of file with wind speeds y
   nm_xstre_file = '/pool/data/AWICM/FESOM2/FORCING/JRA55-do-v1.4.0/uas.'        ! name of file with wind stress x
   nm_ystre_file = '/pool/data/AWICM/FESOM2/FORCING/JRA55-do-v1.4.0/vas.'        ! name of file with wind stress y
   nm_humi_file  = '/pool/data/AWICM/FESOM2/FORCING/JRA55-do-v1.4.0/huss.'        ! name of file with humidity
   nm_qsr_file   = '/pool/data/AWICM/FESOM2/FORCING/JRA55-do-v1.4.0/rsds.'    ! name of file with solar heat
   nm_qlw_file   = '/pool/data/AWICM/FESOM2/FORCING/JRA55-do-v1.4.0/rlds.'    ! name of file with Long wave
   nm_tair_file  = '/pool/data/AWICM/FESOM2/FORCING/JRA55-do-v1.4.0/tas.'        ! name of file with 2m air temperature
   nm_prec_file  = '/pool/data/AWICM/FESOM2/FORCING/JRA55-do-v1.4.0/prra.' ! name of file with total precipitation
   nm_snow_file  = '/pool/data/AWICM/FESOM2/FORCING/JRA55-do-v1.4.0/prsn.' ! name of file with snow  precipitation
   nm_mslp_file  = '/pool/data/AWICM/FESOM2/FORCING/JRA55-do-v1.4.0/psl.'         ! air_pressure_at_sea_level
   nm_xwind_var  = 'uas'   ! name of variable in file with wind
   nm_ywind_var  = 'vas'   ! name of variable in file with wind
   nm_xstre_var  = 'uas'   ! name of variable in file with wind
   nm_ystre_var  = 'vas'   ! name of variable in file with wind
   nm_humi_var   = 'huss'   ! name of variable in file with humidity
   nm_qsr_var    = 'rsds'   ! name of variable in file with solar heat
   nm_qlw_var    = 'rlds'   ! name of variable in file with Long wave
   nm_tair_var   = 'tas'   ! name of variable in file with 2m air temperature
   nm_prec_var   = 'prra'       ! name of variable in file with total precipitation
   nm_snow_var   = 'prsn'       ! name of variable in file with total precipitation
   nm_mslp_var   = 'psl'        ! name of variable in file with air_pressure_at_sea_level
   nm_nc_iyear   = 1900
   nm_nc_imm     = 1            ! initial month of time axis in netCDF
   nm_nc_idd     = 1            ! initial day of time axis in netCDF
   nm_nc_freq    = 1            ! data points per day (i.e. 86400 if the time axis is in seconds)
   nm_nc_tmid    = 0            ! 1 if the time stamps are given at the mid points of the netcdf file, 0 otherwise (i.e. 1 in CORE1, CORE2; 0 in JRA55)
   l_xwind=.true. l_ywind=.true. l_xstre=.false. l_ystre=.false. l_humi=.true. l_qsr=.true. l_qlw=.true. l_tair=.true. l_prec=.true. l_mslp=.false. l_cloud=.false. l_snow=.true.
   runoff_data_source ='CORE2'  !Dai09, CORE2
   nm_runoff_file     ='/pool/data/AWICM/FESOM2/FORCING/JRA55-do-v1.4.0/CORE2_runoff.nc'
   !nm_runoff_file     ='/work/ollie/qwang/FESOM2_input/mesh/CORE2_finaltopo_mean/forcing_data_on_grid/runoff_clim.nc'
   !runoff_data_source ='Dai09'  !Dai09, CORE2, JRA55
   !runoff_climatology =.true.
   sss_data_source    ='CORE2'
   nm_sss_data_file   ='/pool/data/AWICM/FESOM2/FORCING/JRA55-do-v1.4.0/PHC2_salx.nc'
   chl_data_source    ='None' !'Sweeney' monthly chlorophyll climatology or 'NONE' for constant chl_const (below). Make use_sw_pene=.TRUE. in namelist.config!
   nm_chl_data_file   ='/pool/data/AWICM/FESOM2/FORCING/Sweeney/Sweeney_2005.nc'
   chl_const          = 0.1
/

Of course, these paths are related to levante. One needs to go to fesom2/setups/paths.yml to find paths for other machines.

I believe that this information should be clearly stated in documentation, and that job_levante needs to include the copying of these other namelists.

koldunovn commented 1 year ago

Hi @fernandadialzira . Thanks a lot for sharing the experience with us, and your suggestions. Why don't you give a shot on improving the docs yourself - this is usually best done by people who have fresh experience. I added you to the repo, so you just have to make a branch from refactoring and edit the docs. Most of the things you mention should probably go to:

https://github.com/FESOM/fesom2/blob/refactoring/docs/getting_started/getting_started.rst

If you think you have time and willingness to do it, please give it a try, make PR. I will be happy to help you with that.

fernandadialzira commented 1 year ago

Hi @koldunovn! I will try it on Monday.

I think it is a good idea, also so that I can practice those things.