Open aidanheerdegen opened 5 years ago
Sample collation completes successfully using 4 MPI ranks.
The issue is within nc_get_var. Doesn't seem to be the chunk cache, which is only 4 mb.
Trace from valgrind+massif:
99.83% (1,028,279,903B) (heap allocation functions) malloc/new/new[], --alloc-fns, etc.
->95.10% (979,566,312B) 0x54D053D: H5FL_blk_malloc (in /local/swales/conda/mppnccombine-fast-build/lib/libhdf5.so.101.1.0)
| ->94.03% (968,546,008B) 0x544E576: H5D__chunk_lock (in /local/swales/conda/mppnccombine-fast-build/lib/libhdf5.so.101.1.0)
| | ->94.03% (968,546,008B) 0x544F85B: H5D__chunk_read (in /local/swales/conda/mppnccombine-fast-build/lib/libhdf5.so.101.1.0)
| | ->94.03% (968,546,008B) 0x546DE40: H5D__read (in /local/swales/conda/mppnccombine-fast-build/lib/libhdf5.so.101.1.0)
| | ->94.03% (968,546,008B) 0x546E659: H5Dread (in /local/swales/conda/mppnccombine-fast-build/lib/libhdf5.so.101.1.0)
| | ->94.03% (968,546,008B) 0x4EBB7F8: nc4_get_vara (in /local/swales/conda/mppnccombine-fast-build/lib/libnetcdf.so.13)
| | ->94.03% (968,546,008B) 0x4EAEC8D: NC4_get_vara (in /local/swales/conda/mppnccombine-fast-build/lib/libnetcdf.so.13)
| | ->94.03% (968,546,008B) 0x4E5BE39: NC_get_vara (in /local/swales/conda/mppnccombine-fast-build/lib/libnetcdf.so.13)
| | ->94.03% (968,546,008B) 0x4E5D06C: nc_get_vara (in /local/swales/conda/mppnccombine-fast-build/lib/libnetcdf.so.13)
| | ->94.03% (968,546,008B) 0x4056DC: copy_netcdf_variable_chunks (read_chunked.c:224)
| | ->94.03% (968,546,008B) 0x405ED1: copy_chunked (read_chunked.c:379)
| | ->94.03% (968,546,008B) 0x402A13: main (mppnccombine-fast.c:396)
Using 4 ranks spreads the memory usage out, and so doesn't breach the 2GB limit, but still the same overall I guess?
As this appears to be a netCDF/HDF library issue which is beyond our control. Just have to be aware of this when advising user about memory usage.
This command:
Fails with a memory error when run on a login node with a 2GB max memory limit:
The data structure is
I can collate much larger tenth degree data files, and other restart files from the same directory
For example:
Data structure:
and a much smaller output (to look at memory scaling)
Data structure (all 2D fields):
Is that scaling what you would expect @ScottWales?