coecms / mppnccombine-fast

A fast version of mppnccombine
https://mppnccombine-fast.readthedocs.io
Apache License 2.0
3 stars 3 forks source link

Memory scaling #22

Open aidanheerdegen opened 5 years ago

aidanheerdegen commented 5 years ago

This command:

cd /short/v45/aph502/mom/archive/gfdl_troppac_full/restart000
mpirun -n 2 /home/502/aph502/code/c/mppnccombine-fast/mppnccombine-fast -v --debug --force -o ocean_density.res.nc ocean_density.res.nc.*

Fails with a memory error when run on a login node with a 2GB max memory limit:

RSS exceeded.user=aph502, pid=17518, cmd=mppnccombine-fa, rss=2214264, rlim=2097152

The data structure is

/short/v45/aph502/mom/archive/gfdl_troppac_full/restart000/ocean_density.res.nc
Time :  1.0 time
rho               :: (1, 50, 1080, 1440) :: rho
rho_salinity      :: (1, 50, 1080, 1440) :: rho_salinity
pressure_at_depth :: (1, 50, 1080, 1440) :: pressure_at_depth
denominator_r     :: (1, 50, 1080, 1440) :: denominator_r
drhodT            :: (1, 50, 1080, 1440) :: drhodT
drhodS            :: (1, 50, 1080, 1440) :: drhodS
drhodz_zt         :: (1, 50, 1080, 1440) :: drhodz_zt

I can collate much larger tenth degree data files, and other restart files from the same directory

For example:

cd /short/v45/aph502/mom/archive/gfdl_troppac_full/restart000
mpirun -n 2 /home/502/aph502/code/c/mppnccombine-fast/mppnccombine-fast -v --debug --force -o ocean_velocity.res.nc ocean_velocity.res.nc.*
Total compressed size 1.18 GiB | Time 25.45s | 47.65 MiB / sec
10.77user 60.17system 0:50.20elapsed 141%CPU (0avgtext+0avgdata 1101820maxresident)k
0inputs+4222856outputs (5major+724442minor)pagefaults 0swaps

Data structure:

/short/v45/aph502/mom/archive/gfdl_troppac_full/restart000/ocean_velocity.res.nc
Time :  1.0 time
u :: (1, 50, 1080, 1440) :: u
v :: (1, 50, 1080, 1440) :: v

and a much smaller output (to look at memory scaling)

cd /short/v45/aph502/mom/archive/gfdl_troppac_full/restart000
mpirun -n 2 /home/502/aph502/code/c/mppnccombine-fast/mppnccombine-fast -v --debug --force -o ocean_barotropic.res.nc ocean_barotropic.res.nc.*
Total compressed size 0.15 GiB | Time 21.19s | 7.44 MiB / sec
7.93user 40.70system 0:41.04elapsed 118%CPU (0avgtext+0avgdata 386520maxresident)k
8inputs+570096outputs (5major+993245minor)pagefaults 0swaps

Data structure (all 2D fields):

/short/v45/aph502/mom/archive/gfdl_troppac_full/restart000/ocean_barotropic.res.nc
Time :  1.0 time
eta_t         :: (1, 1080, 1440) :: eta_t
anompb        :: (1, 1080, 1440) :: anompb
conv_rho_ud_t :: (1, 1080, 1440) :: conv_rho_ud_t
eta_t_bar     :: (1, 1080, 1440) :: eta_t_bar
anompb_bar    :: (1, 1080, 1440) :: anompb_bar
eta_u         :: (1, 1080, 1440) :: eta_u
pbot_u        :: (1, 1080, 1440) :: pbot_u
patm_t        :: (1, 1080, 1440) :: patm_t
udrho         :: (1, 1080, 1440) :: udrho
vdrho         :: (1, 1080, 1440) :: vdrho
eta_nonbouss  :: (1, 1080, 1440) :: eta_nonbouss
forcing_u_bt  :: (1, 1080, 1440) :: forcing_u_bt
forcing_v_bt  :: (1, 1080, 1440) :: forcing_v_bt

Is that scaling what you would expect @ScottWales?

aidanheerdegen commented 5 years ago

Possibly related

https://forum.hdfgroup.org/t/memory-leak-with-reading-chunked-data-part-2-sec-unclassified/3251

https://forum.hdfgroup.org/t/valgrind-shows-mmemory-leak-during-h5dread/3150

ScottWales commented 5 years ago

Sample collation completes successfully using 4 MPI ranks.

The issue is within nc_get_var. Doesn't seem to be the chunk cache, which is only 4 mb.

Trace from valgrind+massif:

99.83% (1,028,279,903B) (heap allocation functions) malloc/new/new[], --alloc-fns, etc.
->95.10% (979,566,312B) 0x54D053D: H5FL_blk_malloc (in /local/swales/conda/mppnccombine-fast-build/lib/libhdf5.so.101.1.0)
| ->94.03% (968,546,008B) 0x544E576: H5D__chunk_lock (in /local/swales/conda/mppnccombine-fast-build/lib/libhdf5.so.101.1.0)
| | ->94.03% (968,546,008B) 0x544F85B: H5D__chunk_read (in /local/swales/conda/mppnccombine-fast-build/lib/libhdf5.so.101.1.0)
| |   ->94.03% (968,546,008B) 0x546DE40: H5D__read (in /local/swales/conda/mppnccombine-fast-build/lib/libhdf5.so.101.1.0)
| |     ->94.03% (968,546,008B) 0x546E659: H5Dread (in /local/swales/conda/mppnccombine-fast-build/lib/libhdf5.so.101.1.0)
| |       ->94.03% (968,546,008B) 0x4EBB7F8: nc4_get_vara (in /local/swales/conda/mppnccombine-fast-build/lib/libnetcdf.so.13)
| |         ->94.03% (968,546,008B) 0x4EAEC8D: NC4_get_vara (in /local/swales/conda/mppnccombine-fast-build/lib/libnetcdf.so.13)
| |           ->94.03% (968,546,008B) 0x4E5BE39: NC_get_vara (in /local/swales/conda/mppnccombine-fast-build/lib/libnetcdf.so.13)
| |             ->94.03% (968,546,008B) 0x4E5D06C: nc_get_vara (in /local/swales/conda/mppnccombine-fast-build/lib/libnetcdf.so.13)
| |               ->94.03% (968,546,008B) 0x4056DC: copy_netcdf_variable_chunks (read_chunked.c:224)
| |                 ->94.03% (968,546,008B) 0x405ED1: copy_chunked (read_chunked.c:379)
| |                   ->94.03% (968,546,008B) 0x402A13: main (mppnccombine-fast.c:396)
aidanheerdegen commented 5 years ago

Using 4 ranks spreads the memory usage out, and so doesn't breach the 2GB limit, but still the same overall I guess?

aidanheerdegen commented 5 years ago

As this appears to be a netCDF/HDF library issue which is beyond our control. Just have to be aware of this when advising user about memory usage.