SorooshMani-NOAA / OCSMeshTutorial

This repository stores files and documentation used for OCSMesh tutorial Sessions
Creative Commons Zero v1.0 Universal
2 stars 1 forks source link

Memory requirements for running the tutorial #4

Closed SorooshMani-NOAA closed 1 year ago

SorooshMani-NOAA commented 1 year ago

Follow up on (emphasis mine):

I have downloaded WSL and tried it out on my laptop. This does fix an issue relating to saving a mesh (I still experience a permission denied error when attempting to save a mesh outside the main home folder, but that's not a big issue), and the remaining issues I have with OCSMesh involve the amount of memory available on my laptop. The main memory-intensive parts of the tutorial provided seem to be refining the size function with NWM river shapes and the geom.get_multipolygon command applied to the gebco file. How much memory needs to be allocated in order for the notebook to successfully run these parts of the code? Also, in Frontera/TACC, do you know how I would be able to effectively use multiple compute nodes for mesh generation? Using a single compute node results in the notebook running out of memory when refining the size function with NWM river shapes. Meanwhile, running out of memory using geom.get_multipolygon is only an issue when using WSL on my laptop. For reference, my laptop has 16GB of memory and the WSL has 8GB allocated to it.

More specifically the main questions to address are:

How much memory needs to be allocated in order for the notebook to successfully run

and

how I would be able to effectively use multiple compute nodes

SorooshMani-NOAA commented 1 year ago

More context:

To give more detail about the issues I am experiencing, for the part of the tutorial that involves refining the size function with NWM river shapes, when I ran that part in Frontera, the code was running for about an hour and a half, and after it finished, it did not load the river data correctly. Meanwhile, for the tidal run examples, when running the second example on Frontera, the portion of the code generating the mesh gave a memory allocation error. To reduce the amount of memory required for this portion of the code, I adjusted the size of the circle on which the mesh is generated. The tutorial generates a circle with a radius of 170km, and I was able to generate the mesh after changing the radius of the circle to 75km. I only tested radii of 75km and 100km, so the maximum value for which I am able to set the radius and get the tutorial to run properly is somewhere between those two values. I have attached two pictures below: one of which shows the mesh generated for the smaller circle and the other of which shows the output I obtain from trying to run the river refinement part of the code. In both pictures, the left figure is what I obtained from running the code while the right figure is the output shown in the tutorial.

image image

SorooshMani-NOAA commented 1 year ago

I'm still profiling to get the memory requirements, but I think I know what the issue with NWM is. When developing the tutorial, I was using NWM version 2.0 dataset. The first layer of that dataset had river reaches. Right now using the provided link in the instructions to the get NWM dataset, you'll get v2.1. The order of layers in this version is different, so the statement gdf_nwm_rivers = gpd.read_file(data/'NWM_v2.0_channel_hydrofabric/nwm_v2_0_hydrofabric.gdb', layer=0) in example 4 no longer works. That's because it's actually getting the wrong layer.

To fix the issue please use the following line instead. I'll update the Jupyter notebook a bit later.

nwm_gdb = data/'NWM_<YOUR_VERSION>_channel_hydrofabric/nwm_<YOUR_VERSION>_hydrofabric.gdb'
conus_rivers_idx = fiona.listlayers(nwm_gdb).index('nwm_reaches_conus')
gdf_nwm_rivers = gpd.read_file(nwm_gdb, layer=conus_rivers_idx)
SorooshMani-NOAA commented 1 year ago

The following screen captures show the memory need for some of the operations within the tutorial:

image

image

This last one is a little strange, the memory profilers says ~ 14GB but the system monitor showed ~ 36GB memory usage during the operation! image

In any case all of the above operations were run sequentially on a host with total of 96GB RAM

SorooshMani-NOAA commented 1 year ago

Follow up email comm:

After implementing the changes you specified, the code is able to load the river features properly (the plot given before the mesh refinement step is now fixed), but I experience the "Cannot allocate memory" error when attempting to run

hfun_obj_ex4.add_feature(gpd.GeoSeries(rivers_ex4, crs=4326).to_crs(hfun_msh_t_7.crs).unary_union, target_size=50, expansion_rate=0.005)

BrendanGramp commented 1 year ago

OCSMeshTutorialMemoryError

Here is a picture of the error I'm experiencing.

SorooshMani-NOAA commented 1 year ago

@BrendanGramp just to make sure we're on the same page, are you using the notebook as it is or was (only updating paths)? Or have you made any changes to the segments of the code before it?

For example are you passing only the river segments that are intersecting the domain? rivers_ex4 = geom_poly_5.intersection(gdf_nwm_rivers.unary_union) or do you pass all the NWM rivers to add_feature?

Solution In any case, one solution for this issue is passing nprocs argument to add_feature with a small value. For example pass it as 8 or 4 or even 1 and see if the issue is resolved or not. This reduces the number of times static data needs to be replicated for multiprocess segment of the code. Maybe the number of processors on the compute node is so high that if the default value is taken (i.e. the number of available processors) it will result in too many duplications compared to the memory available.

Please let me know how if this helps with the issue. Thanks

BrendanGramp commented 1 year ago

@SorooshMani-NOAA I have been using the same code as what was given in the tutorial outside of changing the names of the nwm and gebco files to match the files that were downloaded.  Also, for the second example generating a larger mesh around NY Harbor, for the command

base_circ_ex2 = geometry.Point([0, 0]).buffer(170e3)

I adjusted the buffer size by making it smaller until I was able to generate a mesh.  I was able to do this when using 80e3 as my buffer value.  I generally run the entire jupyter notebook up to the river refinement step when testing that part of the code while for the part of the code corresponding to the second example of a mesh in the NY Harbor, I only run the first four code blocks of the Jupyter notebook before jumping to that part. Also, in case it is relevant, I've noticed that other parts of the code can result in a "Cannot allocate memory" error if the error is experienced earlier and I have not restarted the kernel.

I will look into the solution you mentioned.

BrendanGramp commented 1 year ago

@SorooshMani-NOAA If I understand this correctly, this would help with the river refinement step, but not with the portion of the tutorial generating a larger mesh for the New York Bay Area, right? Because that is what I experienced. I set nprocs = 1 in the add_feature part of the code and I was able to generate the refined mesh. I am able to pass nprocs = 1 in the .get_multipolygon() step, but I still experience the memory allocation error at the mesh generation step hfun_jig_ex2 = hfun_ex2.msh_t().

So in summary, I can now run the part of the tutorial refining the size function with NWM river shapes without issue, but I am still experiencing issues with the part of the tutorial where a mesh is generated for a larger region around NY harbor.

SorooshMani-NOAA commented 1 year ago

For non-"collector" size functions, you can pass nprocs to the add_feature function, but for collector types, such as in

hfun_ex2 = ocsmesh.Hfun(
    hfun_rasters_ex2,
    base_shape=base_gs_ex2.unary_union,
    base_shape_crs=base_gs_ex2.crs,
    hmin=200, hmax=8000,
    method='fast')

you can pass it to the constructor, i.e.

hfun_ex2 = ocsmesh.Hfun(
    hfun_rasters_ex2,
    base_shape=base_gs_ex2.unary_union,
    base_shape_crs=base_gs_ex2.crs,
    hmin=200, hmax=8000,
    method='fast',
    nprocs=1
)

If adding nprocs=1 there doesn't help, you can also try changing method='fast' to be method='exact' instead.

BrendanGramp commented 1 year ago

@SorooshMani-NOAA Adding nprocs = 1 fixed my issue. I'm now able to run each component of the tutorial Jupyter notebook. Thank you so much!

SorooshMani-NOAA commented 1 year ago

Thanks for your feedback! I'll close this ticket then. Please feel free to reach out (and/or open new tickets) if any issues arise.