Open ashjbarnes opened 5 months ago
OK, good to raise this, but as it reads now it's only a note to self?
If you don't figure it out perhaps please add here a MWE to showcase the error you get so that it's documented. E.g., what is "simple domain"?
(why the figure for #100 needs bathymetry?)
Because I only have the bathy regridded for the high res domain. To expand the domain and have a border of low res I need more bathy (which also gives the land mask)
Simple domain:
yextent = [-56,-26] xextent = [142,180] resolution = 0.05
Everything else set to defaults. This exact code worked fine on 24 cores on a previous version but I'm yet to figure out which changes caused the issue
Thanks!
the kwargs are called longitude_extent
and latitude_extent
now; not sure if that's your mistake.
If you post actual code you run + error I might be able to help
But don't worry about figuring out the historical thread of things -- just try to make this work on current version
Still I don't understand why you need bathymetry for the figure proposed in #100 but that seems like something we should discuss in #100.
No it wasn’t to to with wrong variable names. Everything is input with updates argument names but something breaks (“numpy can’t allocate 2Eib of data”) or hangs forever. Used to take 2 min. More investigation needed
Could you provide w a code snippet that when I copy paste in python or in Jupyter notebook I will get the error?
I made an MWE.
import regional_mom6 as rmom6
import os
import xarray as xr
from pathlib import Path
from dask.distributed import Client
scratch = "/scratch/v45/nc3020"
gdata = "/g/data/v45/nc3020"
home = "/home/552/nc3020"
expt_name = "bathymetry_mwe"
input_dir = f"{scratch}/regional_mom6_configs/{expt_name}/"
run_dir = f"{home}/mom6_rundirs/{expt_name}/"
toolpath_dir = "/home/157/ahg157/repos/mom5/src/tools/"
tmp_dir = f"{gdata}/{expt_name}"
for path in (run_dir, tmp_dir, input_dir):
os.makedirs(str(path), exist_ok=True)
expt = rmom6.experiment(
longitude_extent = (142, 180),
latitude_extent = (-56, -26),
resolution = 1/20,
date_range = ["2003-01-01 00:00:00", "2003-01-05 00:00:00"],
number_vertical_layers = 75,
layer_thickness_ratio = 10,
depth = 4500,
mom_run_dir = run_dir,
mom_input_dir = input_dir,
toolpath_dir = toolpath_dir
)
expt.setup_bathymetry(
bathymetry_path='/g/data/ik11/inputs/GEBCO_2022/GEBCO_2022.nc',
longitude_coordinate_name='lon',
latitude_coordinate_name='lat',
vertical_coordinate_name='elevation',
minimum_layers=1
)
expt.bathymetry.depth.plot()
The above gives
Begin regridding bathymetry...
If this process hangs it means that the chosen domain might be too big to handle this way. After ensuring access to appropriate computational resources, try calling ESMF directly from a terminal in the input directory via
mpirun ESMF_Regrid -s bathymetry_original.nc -d bathymetry_unfinished.nc -m bilinear --src_var elevation --dst_var elevation --netcdf4 --src_regional --dst_regional
For details see https://xesmf.readthedocs.io/en/latest/large_problems_on_HPC.html
Aftewards, we run 'tidy_bathymetry' method to skip the expensive interpolation step, and finishing metadata, encoding and cleanup.
Regridding in parallel: True
and hangs there at least for 10-15min, after which I lost patience and killed the kernel.
However, if I change to
longitude_extent = (142, 144),
latitude_extent = (-56, -52),
resolution = 1/4,
I get this plot after few seconds...
I don't see the claimed bug!
On the contrary, I see that the code warns the user that If this process hangs it means ...
so not only there is no bug but it seems that the code helps the users if they wanna be waiting less.
thanks, point being though that the code used to work with the same sized example and the same sized compute just in the jupyter notebook. So something has still messed up the code's efficiency
OK. A performance issue :)
I've tried with mpirun and that breaks too despite being given ample resources (96 cores, 250gb mem). This points to an issue with the hgrid & raw bathymetry files, as these are what are fed into mpirun script. Or with xESMF itself somehow? I'll keep looking into it but might take me a while
I've been trying to reproduce the figure for the paper, and have therefore been re-making bathymetry. Strangely, some tasks that used to be really simple and fast (eg my region of study at 1/12 degree used to run on one node in ~2min) now hangs
On some further testing, it's now failing as it can't allocate stupid amounts of memory. Somewhere along the line we've messed up this function. I'm not sure how it's still passing the github actions! There's nothing really special about my domain.
I'll keep troubleshooting