COSIMA / regional-mom6

Automatic generation of regional configurations of the Modular Ocean Model 6 (MOM6) in Python
https://regional-mom6.readthedocs.io/en/latest
MIT License
22 stars 11 forks source link

`setup_bathymetry` hangs on simple domain #168

Open ashjbarnes opened 5 months ago

ashjbarnes commented 5 months ago

I've been trying to reproduce the figure for the paper, and have therefore been re-making bathymetry. Strangely, some tasks that used to be really simple and fast (eg my region of study at 1/12 degree used to run on one node in ~2min) now hangs

On some further testing, it's now failing as it can't allocate stupid amounts of memory. Somewhere along the line we've messed up this function. I'm not sure how it's still passing the github actions! There's nothing really special about my domain.

I'll keep troubleshooting

navidcy commented 5 months ago

OK, good to raise this, but as it reads now it's only a note to self?

If you don't figure it out perhaps please add here a MWE to showcase the error you get so that it's documented. E.g., what is "simple domain"?

navidcy commented 5 months ago

(why the figure for #100 needs bathymetry?)

ashjbarnes commented 5 months ago

Because I only have the bathy regridded for the high res domain. To expand the domain and have a border of low res I need more bathy (which also gives the land mask)

Simple domain:

yextent = [-56,-26] xextent = [142,180] resolution = 0.05

Everything else set to defaults. This exact code worked fine on 24 cores on a previous version but I'm yet to figure out which changes caused the issue

navidcy commented 5 months ago

Thanks!

the kwargs are called longitude_extent and latitude_extent now; not sure if that's your mistake. If you post actual code you run + error I might be able to help

But don't worry about figuring out the historical thread of things -- just try to make this work on current version

navidcy commented 5 months ago

Still I don't understand why you need bathymetry for the figure proposed in #100 but that seems like something we should discuss in #100.

ashjbarnes commented 5 months ago

No it wasn’t to to with wrong variable names. Everything is input with updates argument names but something breaks (“numpy can’t allocate 2Eib of data”) or hangs forever. Used to take 2 min. More investigation needed

navidcy commented 5 months ago

Could you provide w a code snippet that when I copy paste in python or in Jupyter notebook I will get the error?

navidcy commented 5 months ago

I made an MWE.

import regional_mom6 as rmom6

import os
import xarray as xr
from pathlib import Path
from dask.distributed import Client

scratch = "/scratch/v45/nc3020"
gdata = "/g/data/v45/nc3020"
home = "/home/552/nc3020"

expt_name = "bathymetry_mwe"

input_dir = f"{scratch}/regional_mom6_configs/{expt_name}/"
run_dir = f"{home}/mom6_rundirs/{expt_name}/"
toolpath_dir = "/home/157/ahg157/repos/mom5/src/tools/"
tmp_dir = f"{gdata}/{expt_name}"

for path in (run_dir, tmp_dir, input_dir):
    os.makedirs(str(path), exist_ok=True)

expt = rmom6.experiment(
    longitude_extent = (142, 180),
    latitude_extent = (-56, -26),
    resolution = 1/20,
    date_range = ["2003-01-01 00:00:00", "2003-01-05 00:00:00"],
    number_vertical_layers = 75,
    layer_thickness_ratio = 10,
    depth = 4500,
    mom_run_dir = run_dir,
    mom_input_dir = input_dir,
    toolpath_dir = toolpath_dir
)

expt.setup_bathymetry(
    bathymetry_path='/g/data/ik11/inputs/GEBCO_2022/GEBCO_2022.nc',
    longitude_coordinate_name='lon',
    latitude_coordinate_name='lat',
    vertical_coordinate_name='elevation',
    minimum_layers=1
    )

expt.bathymetry.depth.plot()
navidcy commented 5 months ago

The above gives

Begin regridding bathymetry...

If this process hangs it means that the chosen domain might be too big to handle this way. After ensuring access to appropriate computational resources, try calling ESMF directly from a terminal in the input directory via

mpirun ESMF_Regrid -s bathymetry_original.nc -d bathymetry_unfinished.nc -m bilinear --src_var elevation --dst_var elevation --netcdf4 --src_regional --dst_regional

For details see https://xesmf.readthedocs.io/en/latest/large_problems_on_HPC.html

Aftewards, we run 'tidy_bathymetry' method to skip the expensive interpolation step, and finishing metadata, encoding and cleanup.
Regridding in parallel: True

and hangs there at least for 10-15min, after which I lost patience and killed the kernel.

However, if I change to

    longitude_extent = (142, 144),
    latitude_extent = (-56, -52),
    resolution = 1/4,

I get this plot after few seconds...

Unknown-2

I don't see the claimed bug!

On the contrary, I see that the code warns the user that If this process hangs it means ... so not only there is no bug but it seems that the code helps the users if they wanna be waiting less.

ashjbarnes commented 5 months ago

thanks, point being though that the code used to work with the same sized example and the same sized compute just in the jupyter notebook. So something has still messed up the code's efficiency

navidcy commented 5 months ago

OK. A performance issue :)

ashjbarnes commented 5 months ago

I've tried with mpirun and that breaks too despite being given ample resources (96 cores, 250gb mem). This points to an issue with the hgrid & raw bathymetry files, as these are what are fed into mpirun script. Or with xESMF itself somehow? I'll keep looking into it but might take me a while