Closed jpolton closed 3 years ago
Tide only forcing (M2+S2) runs for 10 days. Horray!
Over to you @mpayopayo
Run stops at time step 529 (2.93h) without writing any output when time on submit.slurm
is 30'.
I compared my namelist_cfg
submit.slurm
and file_def*xml
to yours and they're the same. I had also done git pull before hand.
Changing time in submit.slurm
to 12h, should be that.
No, the submit.slurm
time units are time permitted on the computer, not simulation time. I notice you don't have a RESTART directory. It might be that. You could also try running with my executable files, if they have somehow ended up different to yours (to track the problem down).
I understand that, but it states that the job is cancelled due to time limit
which seems more to do with the time allocated to the job than to the time I want to simulate.
@jpolton I created a RESTART
folder and also used your executables. I'm getting similar error:
in any case it is time.step 529 that it stops.
Do you need to change the billing id in submit.slurm from n01-ACCORD? I can't recall if you were using CLASS or not now.
I hadn't paid attention to the billing id. I should run with CLASS. trying that now
@jpolton Well that didn't work either. I tried with your executables and also with mine. I've also double checked I've been added to the CLASS id billing account (I am, otherwise it would not go past the queueing stage).
Whatever the case it "lags" at time step 529 after a couple of minutes. The job appears running when doing squeue -u $USER
until it reaches the time in #SBATCH --time
but ocean.output
, time.step
etc are no longer modified.
I checked back and the message in ocean.output and on the slurm*.out for the unforced case -where I was not getting any outputs written- and they're similar to what I'm getting here.
I've also checked for differences in the .xml files but I couldn't find any.
@jpolton All the previous didn't work and I don't get any E R R O R
on the ocean. output.
I may have found it. Your and my mpi subdomains in ocean.output
differ.
The number of mpi processes: jeff 960 (marta 960)
exceeds the maximum number of ocean subdomains = jeff 921 (marta **941**)
we suppressed jeff 1352 (marta **1692**) land subdomains
BUT we had to keep jeff 39 (marta **19**) land subdomains that are useless...
--- YOU ARE WASTING CPU... ---
iom_close ~~~ close file: domain_cfg.nc ok
MPI Message Passing MPI - domain lay out over processors
defines mpp subdomains
jpni = jeff 68 (marta 68)
jpnj = jeff 34 (marta **39**)
sum ilci(i,1) = jeff 476 (marta 476) jpiglo = jeff 342 (marta 342)
sum ilcj(1,j) = jeff 339 (marta **349**) jpjglo = jeff 273 (marta 273)
When I use your domain_cfg,nc
, TIDES
and coordinates.bdy.nc
it passes the 2 minute point where it was breaking.
I've set a longer ran (#SBATCH --time=00:15:00
) with less cores (128) since the areas only have 7*9 points.
For reference with #SBATCH --time=00:30:00
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=32
gets to time.step
of 10609#SBATCH --nodes=8
#SBATCH --ntasks-per-node=120
gets to time.step
of 57073Hmm. So do I have a different domain_cfg.nc to you? coordinates.bdy.nc, and TIDES/* are subsueqently made using PyNEMO and with the domain_cfg.nc as input, so it wouldn't be a surprise if they also differed.
@jpolton The distribution of subdomain differs so I guess so does the domain. I followed your steps, so I'm not quite sure where the glitch is.
It works with your domain_cfg.nc
, coordinates.bdy.nc
and TIDES
:
@jpolton I compared your and my coordinates.bdy.nc
and domain_cfg.nc
.
Comparison of domain_cfg.nc
files
Dimensions
are the same'x'=342; 'y'=273;'z'=31;'t'=1
Variables
your file has more fields than mine (47 vs 43, my file does not include gdept_1d
, gdepw_1d
, gdept_0
, gdepw_0
)
Comparison of coordinates.bdy.nc
files
Dimensions
Jeff Marta
'xbT' 2059 1350
'xbU' 2047 1347
'xbV' 2050 1341
'yb' 1 1
So maybe the problem is in the generation of the boundary conditions and not on the domain as such? I'll check that
@jpolton. I followed again the recipe to create boundary conditions and the tide files have the correct number of nodes (i.e. the same as yours), but the coordinates.bdy.nc
has a different number of nodes to yours (but the same I was getting).
I've looked on other wikis and in https://github.com/NOC-MSM/NEMO-RELOC/wiki/generate_NEMO_obc at the bottom
it says to not use that coordinates.bdy.nc
just generated because wrong rimwidth:
and wonder if it the different values in coordinates.bdy.nc
has to do with me having a different value in rimwidth in namelist_FES14.bdy
than the one you have. I have nn_rimwidth = 9 ! width of the relaxation zone
what do you have?
@jpolton. I followed again the recipe to create boundary conditions and the tide files have the correct number of nodes (i.e. the same as yours), but the
coordinates.bdy.nc
has a different number of nodes to yours (but the same I was getting). I've looked on other wikis and in https://github.com/NOC-MSM/NEMO-RELOC/wiki/generate_NEMO_obc at the bottom it says to not use thatcoordinates.bdy.nc
just generated because wrong rimwidth:and wonder if it the different values in
coordinates.bdy.nc
has to do with me having a different value in rimwidth innamelist_FES14.bdy
than the one you have. I havenn_rimwidth = 9 ! width of the relaxation zone
what do you have?
I haven't yet updated the notes for open boundary conditions for the NEMO-RELOC repository. (I've only got as far as building the domain)
It is odd that you have the correct number of grid points in the tides output files but not in the coordinates.bdy.nc
file unless as you suggest the number of grid points in the coordinates.bdy.nc
file is rim width
times larger than you expected? Or are they as posted previously?
James tells me that setting rimwidth=1
is not necessary when running PyNEMO for tides only, though I often did it for piece of mind.
Regarding what value I have used for rimwidth, you should be able to check. I've changed permission so MPOC can peek:
/login/jelt/SEVERN-SWOT/BUILD_CFG/OPEN_BOUNDARIES
(Spoiler alert: Assuming this is the correct path, I used rimwidth=9
)
@jpolton It would seem that the smaller the rimwidth, the smaller the dimensions in coordinates.bdy.nc
. The dimensions in the tide files stay the same no matter the rimwidth value
dimensions in coordinates.bdy.nc
for rimwidth =1
in namelist_FES14.bdy
xbT = 151 ;
xbU = 151 ;
xbV = 150 ;
yb = 1 ;
dimensions in coordinates.bdy.nc
for rimwidth=9
xbT = 1350 ;
xbU = 1347 ;
xbV = 1341 ;
yb = 1 ;
U, Z tide files dimensions when rimwidth=1
xb = 151 ;
yb = 1 ;
x = 342 ;
y = 273 ;
V tide files with rimwidth=1
xb = 150 ;
yb = 1 ;
x = 342 ;
y = 273 ;
U, Z tide files dimensions when rimwidth=9
xb = 151 ;
yb = 1 ;
x = 342 ;
y = 273 ;
V tide files when ``rimwidth=9```
xb = 150 ;
yb = 1 ;
x = 342 ;
y = 273 ;
Trying with larger rimwidths, I cannot access your folder
@jpolton It would seem that the smaller the rimwidth, the smaller the dimensions in
coordinates.bdy.nc
. The dimensions in the tide files stay the same no matter the rimwidth value dimensions incoordinates.bdy.nc
forrimwidth =1
innamelist_FES14.bdy
xbT = 151 ; xbU = 151 ; xbV = 150 ; yb = 1 ;
dimensions in
coordinates.bdy.nc
forrimwidth=9
xbT = 1350 ; xbU = 1347 ; xbV = 1341 ; yb = 1 ;
U, Z tide files dimensions when
rimwidth=1
xb = 151 ; yb = 1 ; x = 342 ; y = 273 ;
V tide files with
rimwidth=1
xb = 150 ; yb = 1 ; x = 342 ; y = 273 ;
U, Z tide files dimensions when
rimwidth=9
xb = 151 ; yb = 1 ; x = 342 ; y = 273 ;
V tide files when ``rimwidth=9```
xb = 150 ; yb = 1 ; x = 342 ; y = 273 ;
Trying with larger rimwidths, I cannot access your folder
To be clear these are not dimensions but numbers of boundary points in the file. This is why the tides files do not change size with rim width varying, because tides are only imposed on the outer boundary. The point of the rim width variable is to allow the option of a smoother transition, of boundary values, from the edge into the domain. This is possible for U,V,Z fields. This is why there are approximately 9 times more values in the U,V,Z boundary files when rimwidth=9 compared to when rimwidth=1. (I imagine that the number is not exactly "Nx9" because the open boundaries can go around the box corners, making the inner rims successively shorter). Also the numbers of points in the U, V and Z files can be different because they are on different grids (this is a C-grid model). So I think things seem OK.
If you can clarify which directories you would like to be able to access I will change the permissions.
@jpolton if I'm using the same values in rimwidth that you do, shouldn't I be getting the same number of boundary points in the coordinates.bdy.nc? that's not the case, Jeff Marta 'xbT' 2059 1350 'xbU' 2047 1347 'xbV' 2050 1341 'yb' 1 1
I think that's why my run with my files lags while with yours is fine.
The file I don't have access to is the one you mentioned before in /login/jelt/SEVERN-SWOT/BUILD_CFG/OPEN_BOUNDARIES
@jpolton if I'm using the same values in rimwidth that you do, shouldn't I be getting the same number of boundary points in the coordinates.bdy.nc? that's not the case, Jeff Marta 'xbT' 2059 1350 'xbU' 2047 1347 'xbV' 2050 1341 'yb' 1 1
I think that's why my run with my files lags while with yours is fine.
The file I don't have access to is the one you mentioned before in
/login/jelt/SEVERN-SWOT/BUILD_CFG/OPEN_BOUNDARIES
Hmm. 1) I have made the folder readable and executable. If you can not read the file send me the output from the command:
ls -l /login/jelt/SEVERN-SWOT/BUILD_CFG/OPEN_BOUNDARIES/namelist_FES14.bdy
and I will follow up with IT.
2) Point me to the directory where you generate your coordinates.bdy.nc
file. (Maybe run a chmod a+rx -R parent_directory
so I can read it. But we might have the same issue with MSM/MPOC permissions). I'll try and have a look tonight/tomorrow and also see if I can regenerate my files.
@jpolton
I don't have permission:
ls -l /login/jelt/SEVERN-SWOT/BUILD_CFG/OPEN_BOUNDARIES/namelist_FES14.bdy
ls: cannot access /login/jelt/SEVERN-SWOT/BUILD_CFG/OPEN_BOUNDARIES/namelist_FES14.bdy: Permission denied
to generate the coordinates.bdy.nc
I run PyNEMO in /work/marpay/SWOT/SEVERN-SWOT/BUILD_CFG/OPEN_BOUNDARIES
and the files are generated in /work/marpay/SWOT/SEVERN-SWOT/BUILD_CFG/OPEN_BOUNDARIES/OUTPUT
. I've given you permission:
drwxr-xr-x. 8 marpay mpoc 32768 Jul 6 17:36 OPEN_BOUNDARIES
drwxr-xr-x. 2 marpay mpoc 32768 Jul 6 15:40 OUTPUT
@mpayopayo There is something not right happening:
livljobs8 ~ $ ls -l /work/marpay/SWOT/SEVERN-SWOT/BUILD_CFG/OPEN_BOUNDARIES
ls: cannot access /work/marpay/SWOT/SEVERN-SWOT/BUILD_CFG/OPEN_BOUNDARIES: Permission denied
livljobs8 ~ $ ls -l /work/marpay/SWOT/SEVERN-SWOT/
ls: cannot access /work/marpay/SWOT/SEVERN-SWOT/: Permission denied
livljobs8 ~ $ ls -l /work/marpay/SWOT/
ls: cannot access /work/marpay/SWOT/: Permission denied
livljobs8 ~ $ ls -l /work/marpay/
ls: cannot open directory /work/marpay/: Permission denied
livljobs8 ~ $ ls -l /work/marpay
ls: cannot open directory /work/marpay: Permission denied
livljobs8 ~ $ ls -l /work/marpay/
ls: cannot open directory /work/marpay/: Permission denied
I will raise with IT. Perhaps I am being daft or something.
@jpolton, could the source of the differences be the mask file? i can now see your folder, but I cannot find your bdy_mask.nc
- I'm assuming the folder it shuold be in is /work/jelt/SEVERN-SWOT/BUILD_CFG/DOMAIN/
as per the wiki
@jpolton Checked your bdy_mask.nc
vs my bdy_mask
-I think Pynemo calls this file to generate the boundary conditions.
They do differ, same dimensions but different values in 3403 locations.
e.g. Mine has no value -1, yours has 236 -1 values.
So maybe this is the source of the problem?
Ah ha. It sounds like you python fix is not working.
This is this bit on the wiki:
import netCDF4
import numpy as np
dset = netCDF4.Dataset('bdy_mask.nc','a')
[ny,nx] = np.shape(dset.variables['mask'][:])
for i in range(ny):
if dset.variables['mask'][i,1] == 1:
dset.variables['mask'][i,0] = -1
else:
dset.variables['mask'][i,0] = 0
dset.variables['mask'][248::,0:20] = 0 # Mask out rogue 'lake'.
dset.close()
@jpolton found it, my mask is wrong because my domain is wrong because my bathymetry file is wrong. I'm missing the SW bit of the sea. So no need for you to run tide generation with my scripts. I'll go back to the domain generation...
@jpolton found it, my mask is wrong because my domain is wrong because my bathymetry file is wrong. I'm missing the SW bit of the sea. So no need for you to run tide generation with my scripts. I'll go back to the domain generation...
Well done for spotting it. Perhaps the notes for making the bathymetry could have been clearer? Or perhaps it was a workflow issue whereby a slicker "build-all" script might have flushed the problem away? Have a think - surely some aspect of the build process can be improved from this experience.
@jpolton, we may need to do something with the bathy, @micdom is encountering the same issue
@mpayopayo @micdom Hmm. In the 3rd code block of https://github.com/JMMP-Group/SEVERN-SWOT/wiki/2.-Build-domain-configuration-file it looks like the southern chunk of domain is masked out:
import netCDF4
import numpy as np
dset = netCDF4.Dataset('gebco_in.nc','r')
dout = netCDF4.Dataset('fixed_bathy.nc','a')
dout.variables['elevation'][0:99,:] = 0
dout.variables['elevation'][0:200,300::] = 0
dset.close()
dout.close()
For some reason my files do not show this piece of water missing. You could remove or not remove this chunk of water. I suspect I did it as an 'upgrade' to make the domain smaller (maybe based on an updated idea of the domain of interest), but never followed it through with implementing it in my tests.
If everything is kept consistent this shouldn't physically matter.... (indeed @mpayopayo, I think your no-tides worked)
If @micdom gets the same odd ARCHER2 issues when the tides job is submitted, I'll have another go too to try and iron out these oddities.
@jpolton, looking at what I have in the unforced I wonder if it was having the same issue. I was running for 10' in the sbatch time and time.step
was last written 10' before slurm*.out
. There was no error either on the ocean.output
but the last line was the same.
-rw-r--r-- 1 marpay n01 77051 Jun 9 15:54 layout.dat
-rw-r--r-- 1 marpay n01 307361 Jun 9 15:54 ocean.output
-rw-r--r-- 1 marpay n01 10 Jun 9 15:54 time.step
-rw-r--r-- 1 marpay n01 292 Jun 9 15:54 run.stat.nc
-rw-r--r-- 1 marpay n01 0 Jun 9 15:54 run.stat
-rw-r--r-- 1 marpay n01 913346 Jun 9 15:54 output.namelist.dyn
-rw-r--r-- 1 marpay n01 0 Jun 9 15:54 communication_report.txt
-rw-r--r-- 1 marpay n01 151068 Jun 9 16:04 slurm-320871.out
@micdom has run with tides for 24 hrs. So the wiki and scripts are sufficient? So this ticket "Run tide only" is done and can be closed? (@mpayopayo Re-open ticket if you disagree)
New challenges:
These can go on separate tickets on the project board (https://github.com/JMMP-Group/SEVERN-SWOT/projects/1)
@jpolton, reopen because i get segmentation fault.
starting from scratch works fine, there was something wrong on my SEVERN-SWOT
Run the Severn domain with tides only forcing.
Instructions: