Parallel-NetCDF / PnetCDF

Source code repository of PnetCDF library and utilities
https://parallel-netcdf.github.io
Other
80 stars 22 forks source link

Performance issues in login or compute nodes #100

Closed kserradell closed 8 months ago

kserradell commented 1 year ago

Hello

We are deploying a new HPC and we are facing a performance issue. PnetCDF performs worse in applications when we execute in the compute nodes than in the login nodes. We have been doing many tests and changes but we are still clueless about the cause. Both nodes are hardware identical (also NUMA distribution) and the main change is OS (RedHat 8.6) vs. (Rocky Linux 8.6) but should be really similar.

[process@mmlogin01 /gxfs/scratch/process/kserradell/benchmarks/pnetcdf-1.12.2_iompi/benchmarks/FLASH-IO ]$ mpirun -np 128 --bind-to core ./flash_benchmark_io
 number of guards      :             4
 number of blocks      :            80
 number of variables   :            24
 checkpoint time       :            10.26  sec
        max header     :             1.65  sec
        max unknown    :             8.61  sec
        max close      :             1.78  sec
        I/O amount     :         62203.62  MiB
 plot no corner        :             1.23  sec
        max header     :             0.27  sec
        max unknown    :             0.95  sec
        max close      :             0.29  sec
        I/O amount     :          5184.65  MiB
 plot    corner        :             1.20  sec
        max header     :             0.18  sec
        max unknown    :             1.01  sec
        max close      :             0.30  sec
        I/O amount     :          5685.95  MiB
 -------------------------------------------------------
 File base name        : flash_io_test_
 Total I/O amount      :         73074.22  MiB
 -------------------------------------------------------
 nproc    array size      exec (sec)   bandwidth (MiB/s)
  128    32 x  32 x  32     12.69     5758.62

 MPI File Info: nkeys =          10
MPI File Info: [ 0] key =         nc_var_align_size, value =1
MPI File Info: [ 1] key =            romio_cb_write, value =enable
MPI File Info: [ 2] key =            romio_ds_write, value =disable
MPI File Info: [ 3] key =      nc_header_align_size, value =262144
MPI File Info: [ 4] key =      nc_record_align_size, value =512
MPI File Info: [ 5] key = nc_header_read_chunk_size, value =262144
MPI File Info: [ 6] key =          nc_in_place_swap, value =auto
MPI File Info: [ 7] key =              nc_ibuf_size, value =16777216
MPI File Info: [ 8] key =         pnetcdf_subfiling, value =disable
MPI File Info: [ 9] key =           nc_num_subfiles, value =0
[process@mmnode001 /gxfs/scratch/process/kserradell/benchmarks/pnetcdf-1.12.2_iompi/benchmarks/FLASH-IO ]$ mpirun -np 128 --bind-to core ./flash_benchmark_io
 number of guards      :             4
 number of blocks      :            80
 number of variables   :            24
 checkpoint time       :           148.84  sec
        max header     :             1.80  sec
        max unknown    :           147.02  sec
        max close      :            49.12  sec
        I/O amount     :         62203.62  MiB
 plot no corner        :            16.35  sec
        max header     :             0.37  sec
        max unknown    :            15.96  sec
        max close      :            15.30  sec
        I/O amount     :          5184.65  MiB
 plot    corner        :            16.76  sec
        max header     :             0.09  sec
        max unknown    :            16.66  sec
        max close      :            15.94  sec
        I/O amount     :          5685.95  MiB
 -------------------------------------------------------
 File base name        : flash_io_test_
 Total I/O amount      :         73074.22  MiB
 -------------------------------------------------------
 nproc    array size      exec (sec)   bandwidth (MiB/s)
  128    32 x  32 x  32    181.95      401.61

 MPI File Info: nkeys =          10
MPI File Info: [ 0] key =         nc_var_align_size, value =1
MPI File Info: [ 1] key =            romio_cb_write, value =enable
MPI File Info: [ 2] key =            romio_ds_write, value =disable
MPI File Info: [ 3] key =      nc_header_align_size, value =262144
MPI File Info: [ 4] key =      nc_record_align_size, value =512
MPI File Info: [ 5] key = nc_header_read_chunk_size, value =262144
MPI File Info: [ 6] key =          nc_in_place_swap, value =auto
MPI File Info: [ 7] key =              nc_ibuf_size, value =16777216
MPI File Info: [ 8] key =         pnetcdf_subfiling, value =disable
MPI File Info: [ 9] key =           nc_num_subfiles, value =0

Using IOR and NCMPI driver, we don't appreciate those differences. Only in our main app (WRF) and in the FLASH-IO benchmark.

Any help and/or indications to look at would be highly appreciated. I can provide more details if needed.

Thanks,

KiM

wkliao commented 1 year ago

One big factor to parallel I/O performance is the file system. One possibility is /gxfs/scratch/process/kserradell is a local file system on mmlogin01, but a remotely mounted file system on mmnode001. Command mount and df can reveal that information.

You can ask the system administrator for the file system on your HPC machine that should be used for parallel I/O.

kserradell commented 1 year ago

Hello @wkliao

Thanks for your answer. No, in both cases we are writing in GPFS.

The main difference was that the login node has HyperThreading activated and not in compute nodes. We noticed if you use the full node with MPI tasks for the benchmark, there is not enough "room" for other OS tasks, impacting the overall performance of the benchmark. Activating HT and using 128 in a 256 cores node, solved the issue.

wkliao commented 8 months ago

It seems the problem is resolved. This issue can be re-opened if needed.