NOAA-EMC / NCEPLIBS-ncio

This is a NOAA library used by NCEP GSI system to read the GFS forecast files for use in data assimilation.
Other
1 stars 6 forks source link

Add MPI test #47

Closed BrianCurtis-NOAA closed 2 years ago

BrianCurtis-NOAA commented 3 years ago

Adds an MPI version of tst_ncio

Closes #44

BrianCurtis-NOAA commented 3 years ago

Tried testing on Orion, and I keep getting a permission denied with the mpiexec.orion running ctest. I'm missing something, not sure what yet.

edwardhartnett commented 3 years ago

Is this PR going to be developed further?

BrianCurtis-NOAA commented 3 years ago

I'm getting a permissions denied for the call dset = create_dataset('dynf000.nc',dsetin)

 *** Test creation of new dataset from template...
          13 Permission denied
 dynf000.n
BrianCurtis-NOAA commented 3 years ago

nf90_open() and nf90_create() do not have an option for NF90_MPIIO for mode, i don't know why we're using it in the nf90_open() call. The documentation says that a call with variables comm and info set will enable parallel io.

BrianCurtis-NOAA commented 3 years ago
2/2 Testing: tst_ncio_mpi
2/2 Test: tst_ncio_mpi
Command: "/usr/lib64/mpich/bin/mpiexec" "-n" "4" "/home/bcurtis/git/BrianCurtis-NOAA/NCEPLIBS-ncio/build/tests/tst_nci
o_mpi"
Directory: /home/bcurtis/git/BrianCurtis-NOAA/NCEPLIBS-ncio/build/tests
"tst_ncio_mpi" start time: May 17 12:06 EDT
Output:
----------------------------------------------------------
 *** Testing NCEPLIBS-ncio with MPI.
 *** Test creation of new dataset from template...
         -36 NetCDF: Invalid argument
         -36 NetCDF: Invalid argument
STOP 99
         -36 NetCDF: Invalid argument
STOP 99
         -36 NetCDF: Invalid argument
STOP 99
STOP 99
<end of output>
Test time =   0.05 sec
----------------------------------------------------------
Test Failed.
"tst_ncio_mpi" end time: May 17 12:06 EDT
"tst_ncio_mpi" time elapsed: 00:00:00
----------------------------------------------------------

End testing: May 17 12:06 EDT
BrianCurtis-NOAA commented 3 years ago

I've added NC90_MPIIO i've taken it away and coded it as shown the documentation, but the create_dataset() call in NCEPLIBS-ncio doesn't seem to like the extra argument. @edwardhartnett maybe you can dig into it a bit. I will as well.

edwardhartnett commented 3 years ago

@jswhit @CoryMartin-NOAA seems like parallel I/O with NCEPLIBS-ncio does not work. @BrianCurtis-NOAA added a test for it, and it's failing.

Can you help get this code working?

CoryMartin-NOAA commented 3 years ago

It definitely 'works' as we have been using it in GSI in operations for a couple months now. @BrianCurtis-NOAA can you elaborate on your comment about nf90_create not having MPI... create_dataset and open_dataset have an optional argument, paropen, that when paropen=.true., will enable parallel IO

BrianCurtis-NOAA commented 3 years ago

Hi Cory, It's most likely me. I had been running it on my desktop PC, but when I ran it on Hera, it made it past the create_dataset call and gave me segfaults at the writing of variable data. So it may be something on my personal PC. At least I have some information that will help me.

CoryMartin-NOAA commented 3 years ago

@BrianCurtis-NOAA this is a good point, there's a decent chance it won't work correctly on certain machines that don't have parallel IO support. Not sure how to handle that though, @edwardhartnett any suggestions?

edwardhartnett commented 3 years ago

We need to check in CMakeLists.txt to see whether netcdf-c provides parallel I/O and then set a pre-processor flag. Here's how the netcdf-fortran CMake build finds parallel I/O:

# Determine C library setting for NC_HAS_PARALLEL4
try_compile(NF_HAS_PARALLEL4 ${CMAKE_CURRENT_BINARY_DIR}
  "${CMAKE_CURRENT_SOURCE_DIR}/CMakeExtras/check_parallel4.c"
  LINK_LIBRARIES netCDF::netcdf
  )

IF(NF_HAS_PARALLEL4)
  ADD_DEFINITIONS(-DNF_HAS_PARALLEL4)
ENDIF()

Where the file check_parallel4.c is:

#include <netcdf_meta.h>
#if !defined(NC_HAS_PARALLEL4) || NC_HAS_PARALLEL4 == 0
      choke me
#endif
int main() {return 0;}

I would prefer this to not involve a separate file, but instead to happen completely within the CMakeLists.txt file. @BrianCurtis-NOAA are you enough of a cmake expert to do this?

@aerorahul is there a better way? What we want is to compile a program with the netcdf_meta.h file, and determine if NC_HAS_PARALLEL4 is set in that file. This is a general problem - there are lots of settings in the netcdf_meta.h file - perhaps all this should be done in FindNetCDF.cmake?

@kgerheiser this is a problem we will have to solve in many NCEPLIBS cmake builds...

aerorahul commented 3 years ago

@edwardhartnett @BrianCurtis-NOAA FindNetCDF.cmake sets a variable NetCDF_PARALLEL to ON or OFF if it finds (or not) parallel netcdf enabled in the NetCDF library. You can use that.

aerorahul commented 3 years ago

https://github.com/BrianCurtis-NOAA/NCEPLIBS-ncio/blob/50168164b899b8115b89260de711c9728cadf2f6/src/CMakeLists.txt#L36

I believe if you replace the above line with:

target_link_libraries(${lib_name} PUBLIC NetCDF::NetCDF_Fortran)
target_link_libraries(${lib_name} PUBLIC MPI::MPI_Fortran)

with solve your MPI issues.

aerorahul commented 3 years ago

I spent some time looking at this and have several suggestions regarding changes to the build that will make it robust. I can either do that before this PR, in this PR or after this PR, depending on the time it will take to finish work on this PR. Please let me know.

BrianCurtis-NOAA commented 3 years ago

I'm happy to merge in your work, Rahul.

BrianCurtis-NOAA commented 3 years ago

Good news! The MPI test passes on Orion. It is just me.

edwardhartnett commented 3 years ago

@aerorahul no doubt your build suggestions will have wide applicability in NCEPLIBS as we add testing in other places, so I look forward to seeing them and learning from them...

BrianCurtis-NOAA commented 3 years ago

I get the following error with my local machine 4 times:

-114 NetCDF: Parallel operation on file opened for non-parallel access
 dynf000_template.nc.indynf000.ncpressfcvgrdtmp_spreadugrdfooError allocating %lu bytesbarworldhellogrid_xtgrid_ytpfulltime*** idate not correctminutes***time units not correct...*** Test reading of slice...*** Test has_var function...*** Test has_att function...max_abs_compression_errorakSUCCESS!missing_value������������G�����?��G�B�����BGP}Dc�C�A�����/home/bcurtis/git/BrianCurtis-NOAA/NCEPLIBS-ncio/src/module_ncio.f90(' since
STOP 99
aerorahul commented 3 years ago

add line in cmake/PackageConfig.cmake.in

find_dependency(MPI REQUIRED)
BrianCurtis-NOAA commented 3 years ago

add line in cmake/PackageConfig.cmake.in

find_dependency(MPI REQUIRED)

Should we make NetCDF REQUIRED and not specific to Fortran? or should the MPI Cal be specific to Fotran?

aerorahul commented 3 years ago

add line in cmake/PackageConfig.cmake.in

find_dependency(MPI REQUIRED)

Should we make NetCDF REQUIRED and not specific to Fortran? or should the MPI Cal be specific to Fotran?

REQUIRED = Yes. If you don't specify Fortran, you won't get the Fortran NetCDF API. MPI find package works differently and you don't need COMPONENT Fortran for MPI.

aerorahul commented 3 years ago

The MPI test hangs on my macOS with gfortran-10 and mpich.

2: Test command: /opt/modules/clang-12.0.5/mpich/3.3.2/bin/mpiexec "-n" "4" "/Users/rmahajan/scratch/nceplibsWork/nceplibs-ncio/build-2/tests/tst_ncio_mpi"
2: Test timeout computed to be: 60
2:  *** Testing NCEPLIBS-ncio with MPI.
2:  *** Test creation of new dataset from template...
2:  *** Test that number of variables,dimensions,attributes is read...
2:  *** Test read of variable data...
2:  *** Test write of variable data...
2/2 Test #2: tst_ncio_mpi .....................***Timeout  60.15 sec
BrianCurtis-NOAA commented 3 years ago

@CoryMartin-NOAA if i build hpc-stack netcdf on my local machine I can get the MPI tests to pass. It does not pass the CI, wondering what may be the cause, but running out of ideas.

aerorahul commented 3 years ago

@BrianCurtis-NOAA The NetCDF library in the github CI does not have parallel enabled:

-- Found NetCDF: /usr/include (found version "4.7.3") found components: Fortran 
-- FindNetCDF defines targets:
--   - NetCDF_VERSION [4.7.3]
**--   - NetCDF_PARALLEL [FALSE]**
--   - NetCDF_C_CONFIG_EXECUTABLE [/bin/nc-config]
--   - NetCDF::NetCDF_C [SHARED] [Root: /usr] Lib: /usr/lib/x86_64-linux-gnu/libnetcdf.so 
--   - NetCDF_Fortran_CONFIG_EXECUTABLE [/bin/nf-config]
--   - NetCDF::NetCDF_Fortran [SHARED] [Root: /usr] Lib: /usr/lib/x86_64-linux-gnu/libnetcdff.so 

I am not sure how to do that with apt-get and brew.

BrianCurtis-NOAA commented 3 years ago

@edwardhartnett Do you know how to get the netcdf library in the github CI to use parallel?

kgerheiser commented 3 years ago

@edwardhartnett is on holiday until early August.

NetCDF/HDF5 needs to be built manually, instead of using apt.

We build it manually in the UFS_UTILS CI here: https://github.com/NOAA-EMC/UFS_UTILS/blob/develop/.github/workflows/netcdf-versions.yml

You could copy and paste that here and remove the NetCDF matrix.

BrianCurtis-NOAA commented 3 years ago

I am happy to do that, but i haven't messed with the github CI before. Let me know if you would rather do it.

kgerheiser commented 3 years ago

I can do it real fast

BrianCurtis-NOAA commented 3 years ago

Fails on MacOS. I don't have MacOS to try on. @aerorahul you did mention it times out with MPI on your own machine. Any ideas on what we could try?

aerorahul commented 3 years ago

Fails on MacOS. I don't have MacOS to try on. @aerorahul you did mention it times out with MPI on your own machine. Any ideas on what we could try?

The MPI test also times out in the macOS action, not just my machine.

kgerheiser commented 3 years ago

Interestingly, it times out for me on macOS using Intel/OpenMPI too. Not just GNU/MPICH.

aerorahul commented 3 years ago

Interestingly, it times out for me on macOS using Intel/OpenMPI too. Not just GNU/MPICH.

It is something in write_vardata and specifically in nf90_put_var. That's where I last stopped in the debugger.

CoryMartin-NOAA commented 3 years ago

My guess is there is a process doing nothing, the writes are generally MPI barriers so it will hang if at least one of the PEs has not written to the file.

edwardhartnett commented 3 years ago

Try simplfiyng the test program by commenting out everything it does and see if it passes macos.

Then, step by step, uncomment the code, so that at first you are just creating/closing the file. Then, one step at a time, start uncommenting out the lines that do things to the netCDF file, until you find the exact line of code that is causing the problem.

As Cory notes, all processing elements must be calling the write routine, because this is a collective action. @BrianCurtis-NOAA you need to figure out which pe is not calling the write code.

Another way to proceed is to look in the ncio code and just copy the netCDF calls directly into your test code, to cut ncio out of the loop. Once you can get the netCDF code to work, you can then see what is different between the netCDF code and the ncio code.

BrianCurtis-NOAA commented 3 years ago

Is there a way for me to test this on MacOS other than making code changes in the repo and letting the github actions test it?

edwardhartnett commented 3 years ago

@BrianCurtis-NOAA checking in changes and seeing what GitHub actions does is how I work on MacOS. It's not perfect but it gets the job done.

BrianCurtis-NOAA commented 3 years ago

@CoryMartin-NOAA It looks as if the write_vardata works the first call, but when that returns finished and moved onto another write_vardata it doesn't like something. You mentioned a process doing nothing, would this still hold true here? I'm still confused why this is specific to the MacOS and not GNU.

BrianCurtis-NOAA commented 3 years ago

@aerorahul Did you have a fork/repo for your cmake changes you gave me a while back? I don't wish to duplicate work you've already done

BrianCurtis-NOAA commented 2 years ago

@edwardhartnett what does error 9 Bad file descriptor mean?

2: Test timeout computed to be: 60
2:  *** Testing NCEPLIBS-ncio with MPI.
2:  *** Testing funtion open_dataset with paropen=.true.
2:            9 Bad file descriptor

Do I need to write my own file to read from? I was using a copy of the one used for the tst_ncio.F90

EDIT: This is only a part of the output from the failed test..

BrianCurtis-NOAA commented 2 years ago
  integer, parameter :: YT = 128, XT = 256
  integer, parameter :: HALF_YT = YT/2, HALF_XT = XT/2

...

! float pressfc(time, grid_yt, grid_xt)
  ! dimensions:
        ! grid_xt = 256 ;
        ! grid_yt = 128 ;
        ! pfull = 64 ;
        ! phalf = 65 ;
        ! time = UNLIMITED ; // (1 currently)
        ! member = 2 ;
  count = (/ 1, HALF_YT, HALF_XT /)
  if (my_rank .eq. 0) then
     start = (/ 1, 1, 1 /)
  else if (my_rank .eq. 1) then
     start = (/ 1, HALF_YT + 1, 1 /)
  else if (my_rank .eq. 2) then
     start = (/ 1, 1, HALF_XT + 1 /)
  else if (my_rank .eq. 3) then
     start = (/ 1, HALF_YT + 1, HALF_XT + 1 /)
  endif

  call read_vardata(dsetin, 'pressfc', values_3d, ncstart=start, nccount=count, errcode=errcode)
  call check(errcode)

Running test

2/2 Test #2: tst_ncio_mpi .....................***Failed    2.25 sec
 *** Testing NCEPLIBS-ncio with MPI.
 *** Testing function open_dataset with paropen=.true.
         -57 NetCDF: Start+count exceeds dimension bound
STOP 99
         -40 NetCDF: Index exceeds dimension bound
STOP 99
         -40 NetCDF: Index exceeds dimension bound
STOP 99
 Error: NetCDF: Index exceeds dimension bound
 *** Test that number of variables,dimensions,attributes is read...
 *** Test read of variable data...
 Error: NetCDF: Start+count exceeds dimension bound
 Error: NetCDF: Index exceeds dimension bound
 Error: NetCDF: Start+count exceeds dimension bound
         -57 NetCDF: Start+count exceeds dimension bound
STOP 99

@edwardhartnett not sure where my count is off here. I can push my changes to the test if you want to look at the whole file.

BrianCurtis-NOAA commented 2 years ago

@edwardhartnett I think i've found out what was backwards:

Is ncdump a bad tool, or is read_vardata doing it backwards?

CoryMartin-NOAA commented 2 years ago

@BrianCurtis-NOAA ncdump is C-style, Fortran is the opposite (column/row ordering)

BrianCurtis-NOAA commented 2 years ago

@BrianCurtis-NOAA ncdump is C-style, Fortran is the opposite (column/row ordering)

Thanks! I had no idea that ncdump was C-based.

edwardhartnett commented 2 years ago

Yes, ncdump is C based, Full documentation can be found here https://www.unidata.ucar.edu/software/netcdf/documentation/NUG/netcdf_utilities_guide.html#ncdump_guide

I always forget the C/Fortran ordering issue...

As the error message says, index plus count is too big. Put all your start/counts in fortran order.