JMMP-Group / SEVERN-SWOT

Severn estuary 500m ocean model
MIT License
1 stars 2 forks source link

Run tide only #6

Closed jpolton closed 3 years ago

jpolton commented 3 years ago

Run the Severn domain with tides only forcing.

Instructions:

jpolton commented 3 years ago

Screen_Capture_-_27_May__1_41_pm

Tide only forcing (M2+S2) runs for 10 days. Horray!

Over to you @mpayopayo

mpayopayo commented 3 years ago

Run stops at time step 529 (2.93h) without writing any output when time on submit.slurm is 30'.
image I compared my namelist_cfg submit.slurm and file_def*xml to yours and they're the same. I had also done git pull before hand. Changing time in submit.slurm to 12h, should be that.

jpolton commented 3 years ago

No, the submit.slurm time units are time permitted on the computer, not simulation time. I notice you don't have a RESTART directory. It might be that. You could also try running with my executable files, if they have somehow ended up different to yours (to track the problem down).

mpayopayo commented 3 years ago

I understand that, but it states that the job is cancelled due to time limit which seems more to do with the time allocated to the job than to the time I want to simulate.

mpayopayo commented 3 years ago

@jpolton I created a RESTART folder and also used your executables. I'm getting similar error: image in any case it is time.step 529 that it stops.

jpolton commented 3 years ago

Do you need to change the billing id in submit.slurm from n01-ACCORD? I can't recall if you were using CLASS or not now.

mpayopayo commented 3 years ago

I hadn't paid attention to the billing id. I should run with CLASS. trying that now

mpayopayo commented 3 years ago

@jpolton Well that didn't work either. I tried with your executables and also with mine. I've also double checked I've been added to the CLASS id billing account (I am, otherwise it would not go past the queueing stage).

Whatever the case it "lags" at time step 529 after a couple of minutes. The job appears running when doing squeue -u $USER until it reaches the time in #SBATCH --time but ocean.output, time.step etc are no longer modified.

I checked back and the message in ocean.output and on the slurm*.out for the unforced case -where I was not getting any outputs written- and they're similar to what I'm getting here.

I've also checked for differences in the .xml files but I couldn't find any.

mpayopayo commented 3 years ago

@jpolton All the previous didn't work and I don't get any E R R O R on the ocean. output.

I may have found it. Your and my mpi subdomains in ocean.output differ.

The number of mpi processes:  jeff  960  (marta 960)
exceeds the maximum number of ocean subdomains =  jeff  921 (marta **941**)
we suppressed jeff 1352 (marta **1692**) land subdomains
BUT we had to keep  jeff  39  (marta **19**) land subdomains that are useless...

 --- YOU ARE WASTING CPU... ---

                iom_close ~~~ close file: domain_cfg.nc ok

MPI Message Passing MPI - domain lay out over processors

defines mpp subdomains
   jpni = jeff  68 (marta 68)
   jpnj = jeff 34 (marta **39**)

   sum ilci(i,1) = jeff  476 (marta 476)  jpiglo = jeff 342 (marta 342)
   sum ilcj(1,j) =  jeff 339 (marta **349**) jpjglo = jeff 273 (marta 273)

When I use your domain_cfg,nc, TIDES and coordinates.bdy.nc it passes the 2 minute point where it was breaking. I've set a longer ran (#SBATCH --time=00:15:00) with less cores (128) since the areas only have 7*9 points.

For reference with #SBATCH --time=00:30:00

jpolton commented 3 years ago

Hmm. So do I have a different domain_cfg.nc to you? coordinates.bdy.nc, and TIDES/* are subsueqently made using PyNEMO and with the domain_cfg.nc as input, so it wouldn't be a surprise if they also differed.

mpayopayo commented 3 years ago

@jpolton The distribution of subdomain differs so I guess so does the domain. I followed your steps, so I'm not quite sure where the glitch is. It works with your domain_cfg.nc, coordinates.bdy.nc and TIDES: image

mpayopayo commented 3 years ago

@jpolton I compared your and my coordinates.bdy.nc and domain_cfg.nc.

Comparison of domain_cfg.nc files Dimensions are the same'x'=342; 'y'=273;'z'=31;'t'=1 Variables your file has more fields than mine (47 vs 43, my file does not include gdept_1d, gdepw_1d, gdept_0, gdepw_0)

Comparison of coordinates.bdy.nc files Dimensions Jeff Marta 'xbT' 2059 1350 'xbU' 2047 1347 'xbV' 2050 1341 'yb' 1 1

So maybe the problem is in the generation of the boundary conditions and not on the domain as such? I'll check that

mpayopayo commented 3 years ago

@jpolton. I followed again the recipe to create boundary conditions and the tide files have the correct number of nodes (i.e. the same as yours), but the coordinates.bdy.nc has a different number of nodes to yours (but the same I was getting). I've looked on other wikis and in https://github.com/NOC-MSM/NEMO-RELOC/wiki/generate_NEMO_obc at the bottom it says to not use that coordinates.bdy.nc just generated because wrong rimwidth:

image

and wonder if it the different values in coordinates.bdy.nc has to do with me having a different value in rimwidth in namelist_FES14.bdy than the one you have. I have nn_rimwidth = 9 ! width of the relaxation zone what do you have?

jpolton commented 3 years ago

@jpolton. I followed again the recipe to create boundary conditions and the tide files have the correct number of nodes (i.e. the same as yours), but the coordinates.bdy.nc has a different number of nodes to yours (but the same I was getting). I've looked on other wikis and in https://github.com/NOC-MSM/NEMO-RELOC/wiki/generate_NEMO_obc at the bottom it says to not use that coordinates.bdy.nc just generated because wrong rimwidth:

image

and wonder if it the different values in coordinates.bdy.nc has to do with me having a different value in rimwidth in namelist_FES14.bdy than the one you have. I have nn_rimwidth = 9 ! width of the relaxation zone what do you have?

I haven't yet updated the notes for open boundary conditions for the NEMO-RELOC repository. (I've only got as far as building the domain) It is odd that you have the correct number of grid points in the tides output files but not in the coordinates.bdy.nc file unless as you suggest the number of grid points in the coordinates.bdy.nc file is rim width times larger than you expected? Or are they as posted previously? James tells me that setting rimwidth=1 is not necessary when running PyNEMO for tides only, though I often did it for piece of mind.

Regarding what value I have used for rimwidth, you should be able to check. I've changed permission so MPOC can peek: /login/jelt/SEVERN-SWOT/BUILD_CFG/OPEN_BOUNDARIES (Spoiler alert: Assuming this is the correct path, I used rimwidth=9)

mpayopayo commented 3 years ago

@jpolton It would seem that the smaller the rimwidth, the smaller the dimensions in coordinates.bdy.nc. The dimensions in the tide files stay the same no matter the rimwidth value dimensions in coordinates.bdy.nc for rimwidth =1 in namelist_FES14.bdy

        xbT = 151 ;
        xbU = 151 ;
        xbV = 150 ;
        yb = 1 ;

dimensions in coordinates.bdy.nc for rimwidth=9

        xbT = 1350 ;
        xbU = 1347 ;
        xbV = 1341 ;
        yb = 1 ;

U, Z tide files dimensions when rimwidth=1

        xb = 151 ;
        yb = 1 ;
        x = 342 ;
        y = 273 ;

V tide files with rimwidth=1

        xb = 150 ;
        yb = 1 ;
        x = 342 ;
        y = 273 ;

U, Z tide files dimensions when rimwidth=9

        xb = 151 ;
        yb = 1 ;
        x = 342 ;
        y = 273 ;

V tide files when ``rimwidth=9```

        xb = 150 ;
        yb = 1 ;
        x = 342 ;
        y = 273 ;

Trying with larger rimwidths, I cannot access your folder

jpolton commented 3 years ago

@jpolton It would seem that the smaller the rimwidth, the smaller the dimensions in coordinates.bdy.nc. The dimensions in the tide files stay the same no matter the rimwidth value dimensions in coordinates.bdy.nc for rimwidth =1 in namelist_FES14.bdy

        xbT = 151 ;
        xbU = 151 ;
        xbV = 150 ;
        yb = 1 ;

dimensions in coordinates.bdy.nc for rimwidth=9

        xbT = 1350 ;
        xbU = 1347 ;
        xbV = 1341 ;
        yb = 1 ;

U, Z tide files dimensions when rimwidth=1

        xb = 151 ;
        yb = 1 ;
        x = 342 ;
        y = 273 ;

V tide files with rimwidth=1

        xb = 150 ;
        yb = 1 ;
        x = 342 ;
        y = 273 ;

U, Z tide files dimensions when rimwidth=9

        xb = 151 ;
        yb = 1 ;
        x = 342 ;
        y = 273 ;

V tide files when ``rimwidth=9```

        xb = 150 ;
        yb = 1 ;
        x = 342 ;
        y = 273 ;

Trying with larger rimwidths, I cannot access your folder

To be clear these are not dimensions but numbers of boundary points in the file. This is why the tides files do not change size with rim width varying, because tides are only imposed on the outer boundary. The point of the rim width variable is to allow the option of a smoother transition, of boundary values, from the edge into the domain. This is possible for U,V,Z fields. This is why there are approximately 9 times more values in the U,V,Z boundary files when rimwidth=9 compared to when rimwidth=1. (I imagine that the number is not exactly "Nx9" because the open boundaries can go around the box corners, making the inner rims successively shorter). Also the numbers of points in the U, V and Z files can be different because they are on different grids (this is a C-grid model). So I think things seem OK.

If you can clarify which directories you would like to be able to access I will change the permissions.

mpayopayo commented 3 years ago

@jpolton if I'm using the same values in rimwidth that you do, shouldn't I be getting the same number of boundary points in the coordinates.bdy.nc? that's not the case, Jeff Marta 'xbT' 2059 1350 'xbU' 2047 1347 'xbV' 2050 1341 'yb' 1 1

I think that's why my run with my files lags while with yours is fine.

The file I don't have access to is the one you mentioned before in /login/jelt/SEVERN-SWOT/BUILD_CFG/OPEN_BOUNDARIES

jpolton commented 3 years ago

@jpolton if I'm using the same values in rimwidth that you do, shouldn't I be getting the same number of boundary points in the coordinates.bdy.nc? that's not the case, Jeff Marta 'xbT' 2059 1350 'xbU' 2047 1347 'xbV' 2050 1341 'yb' 1 1

I think that's why my run with my files lags while with yours is fine.

The file I don't have access to is the one you mentioned before in /login/jelt/SEVERN-SWOT/BUILD_CFG/OPEN_BOUNDARIES

Hmm. 1) I have made the folder readable and executable. If you can not read the file send me the output from the command:

 ls -l /login/jelt/SEVERN-SWOT/BUILD_CFG/OPEN_BOUNDARIES/namelist_FES14.bdy

and I will follow up with IT.

2) Point me to the directory where you generate your coordinates.bdy.nc file. (Maybe run a chmod a+rx -R parent_directory so I can read it. But we might have the same issue with MSM/MPOC permissions). I'll try and have a look tonight/tomorrow and also see if I can regenerate my files.

mpayopayo commented 3 years ago

@jpolton

  1. I don't have permission: ls -l /login/jelt/SEVERN-SWOT/BUILD_CFG/OPEN_BOUNDARIES/namelist_FES14.bdy ls: cannot access /login/jelt/SEVERN-SWOT/BUILD_CFG/OPEN_BOUNDARIES/namelist_FES14.bdy: Permission denied

  2. to generate the coordinates.bdy.nc I run PyNEMO in /work/marpay/SWOT/SEVERN-SWOT/BUILD_CFG/OPEN_BOUNDARIES and the files are generated in /work/marpay/SWOT/SEVERN-SWOT/BUILD_CFG/OPEN_BOUNDARIES/OUTPUT . I've given you permission:

drwxr-xr-x. 8 marpay mpoc 32768 Jul 6 17:36 OPEN_BOUNDARIES drwxr-xr-x. 2 marpay mpoc 32768 Jul 6 15:40 OUTPUT

jpolton commented 3 years ago

@mpayopayo There is something not right happening:

livljobs8 ~ $ ls -l /work/marpay/SWOT/SEVERN-SWOT/BUILD_CFG/OPEN_BOUNDARIES
ls: cannot access /work/marpay/SWOT/SEVERN-SWOT/BUILD_CFG/OPEN_BOUNDARIES: Permission denied
livljobs8 ~ $ ls -l /work/marpay/SWOT/SEVERN-SWOT/
ls: cannot access /work/marpay/SWOT/SEVERN-SWOT/: Permission denied
livljobs8 ~ $ ls -l /work/marpay/SWOT/
ls: cannot access /work/marpay/SWOT/: Permission denied
livljobs8 ~ $ ls -l /work/marpay/
ls: cannot open directory /work/marpay/: Permission denied
livljobs8 ~ $ ls -l /work/marpay
ls: cannot open directory /work/marpay: Permission denied
livljobs8 ~ $ ls -l /work/marpay/
ls: cannot open directory /work/marpay/: Permission denied

I will raise with IT. Perhaps I am being daft or something.

mpayopayo commented 3 years ago

@jpolton, could the source of the differences be the mask file? i can now see your folder, but I cannot find your bdy_mask.nc - I'm assuming the folder it shuold be in is /work/jelt/SEVERN-SWOT/BUILD_CFG/DOMAIN/ as per the wiki

mpayopayo commented 3 years ago

@jpolton Checked your bdy_mask.nc vs my bdy_mask -I think Pynemo calls this file to generate the boundary conditions. They do differ, same dimensions but different values in 3403 locations. e.g. Mine has no value -1, yours has 236 -1 values.

So maybe this is the source of the problem?

jpolton commented 3 years ago

Ah ha. It sounds like you python fix is not working.

This is this bit on the wiki:

import netCDF4
import numpy as np
dset = netCDF4.Dataset('bdy_mask.nc','a')
[ny,nx] = np.shape(dset.variables['mask'][:])
for i in range(ny):
  if dset.variables['mask'][i,1] == 1:
    dset.variables['mask'][i,0] = -1
  else:
    dset.variables['mask'][i,0] = 0

dset.variables['mask'][248::,0:20] = 0 # Mask out rogue 'lake'.
dset.close()
mpayopayo commented 3 years ago

@jpolton found it, my mask is wrong because my domain is wrong because my bathymetry file is wrong. I'm missing the SW bit of the sea. So no need for you to run tide generation with my scripts. I'll go back to the domain generation... image

image

jpolton commented 3 years ago

@jpolton found it, my mask is wrong because my domain is wrong because my bathymetry file is wrong. I'm missing the SW bit of the sea. So no need for you to run tide generation with my scripts. I'll go back to the domain generation... image

image

Well done for spotting it. Perhaps the notes for making the bathymetry could have been clearer? Or perhaps it was a workflow issue whereby a slicker "build-all" script might have flushed the problem away? Have a think - surely some aspect of the build process can be improved from this experience.

mpayopayo commented 3 years ago

@jpolton, we may need to do something with the bathy, @micdom is encountering the same issue

Screenshot 2021-07-08 at 15 41 05
jpolton commented 3 years ago

@mpayopayo @micdom Hmm. In the 3rd code block of https://github.com/JMMP-Group/SEVERN-SWOT/wiki/2.-Build-domain-configuration-file it looks like the southern chunk of domain is masked out:

import netCDF4
import numpy as np
dset = netCDF4.Dataset('gebco_in.nc','r')
dout = netCDF4.Dataset('fixed_bathy.nc','a')

dout.variables['elevation'][0:99,:] = 0
dout.variables['elevation'][0:200,300::] = 0

dset.close()
dout.close()

For some reason my files do not show this piece of water missing. You could remove or not remove this chunk of water. I suspect I did it as an 'upgrade' to make the domain smaller (maybe based on an updated idea of the domain of interest), but never followed it through with implementing it in my tests.

If everything is kept consistent this shouldn't physically matter.... (indeed @mpayopayo, I think your no-tides worked)

If @micdom gets the same odd ARCHER2 issues when the tides job is submitted, I'll have another go too to try and iron out these oddities.

mpayopayo commented 3 years ago

@jpolton, looking at what I have in the unforced I wonder if it was having the same issue. I was running for 10' in the sbatch time and time.step was last written 10' before slurm*.out. There was no error either on the ocean.output but the last line was the same.

-rw-r--r-- 1 marpay n01   77051 Jun  9 15:54 layout.dat
-rw-r--r-- 1 marpay n01  307361 Jun  9 15:54 ocean.output
-rw-r--r-- 1 marpay n01      10 Jun  9 15:54 time.step
-rw-r--r-- 1 marpay n01     292 Jun  9 15:54 run.stat.nc
-rw-r--r-- 1 marpay n01       0 Jun  9 15:54 run.stat
-rw-r--r-- 1 marpay n01  913346 Jun  9 15:54 output.namelist.dyn
-rw-r--r-- 1 marpay n01       0 Jun  9 15:54 communication_report.txt
-rw-r--r-- 1 marpay n01  151068 Jun  9 16:04 slurm-320871.out
jpolton commented 3 years ago

@micdom has run with tides for 24 hrs. So the wiki and scripts are sufficient? So this ticket "Run tide only" is done and can be closed? (@mpayopayo Re-open ticket if you disagree)

New challenges:

These can go on separate tickets on the project board (https://github.com/JMMP-Group/SEVERN-SWOT/projects/1)

mpayopayo commented 3 years ago

@jpolton, reopen because i get segmentation fault.

mpayopayo commented 3 years ago

starting from scratch works fine, there was something wrong on my SEVERN-SWOT