Closed BrianCurtis-NOAA closed 2 years ago
Tried testing on Orion, and I keep getting a permission denied with the mpiexec.orion running ctest. I'm missing something, not sure what yet.
Is this PR going to be developed further?
I'm getting a permissions denied for the call dset = create_dataset('dynf000.nc',dsetin)
*** Test creation of new dataset from template...
13 Permission denied
dynf000.n
nf90_open() and nf90_create() do not have an option for NF90_MPIIO for mode, i don't know why we're using it in the nf90_open() call. The documentation says that a call with variables comm and info set will enable parallel io.
2/2 Testing: tst_ncio_mpi
2/2 Test: tst_ncio_mpi
Command: "/usr/lib64/mpich/bin/mpiexec" "-n" "4" "/home/bcurtis/git/BrianCurtis-NOAA/NCEPLIBS-ncio/build/tests/tst_nci
o_mpi"
Directory: /home/bcurtis/git/BrianCurtis-NOAA/NCEPLIBS-ncio/build/tests
"tst_ncio_mpi" start time: May 17 12:06 EDT
Output:
----------------------------------------------------------
*** Testing NCEPLIBS-ncio with MPI.
*** Test creation of new dataset from template...
-36 NetCDF: Invalid argument
-36 NetCDF: Invalid argument
STOP 99
-36 NetCDF: Invalid argument
STOP 99
-36 NetCDF: Invalid argument
STOP 99
STOP 99
<end of output>
Test time = 0.05 sec
----------------------------------------------------------
Test Failed.
"tst_ncio_mpi" end time: May 17 12:06 EDT
"tst_ncio_mpi" time elapsed: 00:00:00
----------------------------------------------------------
End testing: May 17 12:06 EDT
I've added NC90_MPIIO i've taken it away and coded it as shown the documentation, but the create_dataset() call in NCEPLIBS-ncio doesn't seem to like the extra argument. @edwardhartnett maybe you can dig into it a bit. I will as well.
@jswhit @CoryMartin-NOAA seems like parallel I/O with NCEPLIBS-ncio does not work. @BrianCurtis-NOAA added a test for it, and it's failing.
Can you help get this code working?
It definitely 'works' as we have been using it in GSI in operations for a couple months now. @BrianCurtis-NOAA can you elaborate on your comment about nf90_create not having MPI... create_dataset and open_dataset have an optional argument, paropen, that when paropen=.true., will enable parallel IO
Hi Cory, It's most likely me. I had been running it on my desktop PC, but when I ran it on Hera, it made it past the create_dataset call and gave me segfaults at the writing of variable data. So it may be something on my personal PC. At least I have some information that will help me.
@BrianCurtis-NOAA this is a good point, there's a decent chance it won't work correctly on certain machines that don't have parallel IO support. Not sure how to handle that though, @edwardhartnett any suggestions?
We need to check in CMakeLists.txt to see whether netcdf-c provides parallel I/O and then set a pre-processor flag. Here's how the netcdf-fortran CMake build finds parallel I/O:
# Determine C library setting for NC_HAS_PARALLEL4
try_compile(NF_HAS_PARALLEL4 ${CMAKE_CURRENT_BINARY_DIR}
"${CMAKE_CURRENT_SOURCE_DIR}/CMakeExtras/check_parallel4.c"
LINK_LIBRARIES netCDF::netcdf
)
IF(NF_HAS_PARALLEL4)
ADD_DEFINITIONS(-DNF_HAS_PARALLEL4)
ENDIF()
Where the file check_parallel4.c is:
#include <netcdf_meta.h>
#if !defined(NC_HAS_PARALLEL4) || NC_HAS_PARALLEL4 == 0
choke me
#endif
int main() {return 0;}
I would prefer this to not involve a separate file, but instead to happen completely within the CMakeLists.txt file. @BrianCurtis-NOAA are you enough of a cmake expert to do this?
@aerorahul is there a better way? What we want is to compile a program with the netcdf_meta.h file, and determine if NC_HAS_PARALLEL4 is set in that file. This is a general problem - there are lots of settings in the netcdf_meta.h file - perhaps all this should be done in FindNetCDF.cmake?
@kgerheiser this is a problem we will have to solve in many NCEPLIBS cmake builds...
@edwardhartnett @BrianCurtis-NOAA
FindNetCDF.cmake
sets a variable NetCDF_PARALLEL
to ON
or OFF
if it finds (or not) parallel netcdf enabled in the NetCDF library.
You can use that.
I believe if you replace the above line with:
target_link_libraries(${lib_name} PUBLIC NetCDF::NetCDF_Fortran)
target_link_libraries(${lib_name} PUBLIC MPI::MPI_Fortran)
with solve your MPI issues.
I spent some time looking at this and have several suggestions regarding changes to the build that will make it robust. I can either do that before this PR, in this PR or after this PR, depending on the time it will take to finish work on this PR. Please let me know.
I'm happy to merge in your work, Rahul.
Good news! The MPI test passes on Orion. It is just me.
@aerorahul no doubt your build suggestions will have wide applicability in NCEPLIBS as we add testing in other places, so I look forward to seeing them and learning from them...
I get the following error with my local machine 4 times:
-114 NetCDF: Parallel operation on file opened for non-parallel access
dynf000_template.nc.indynf000.ncpressfcvgrdtmp_spreadugrdfooError allocating %lu bytesbarworldhellogrid_xtgrid_ytpfulltime*** idate not correctminutes***time units not correct...*** Test reading of slice...*** Test has_var function...*** Test has_att function...max_abs_compression_errorakSUCCESS!missing_value������������G�����?��G�B�����BGP}Dc�C�A�����/home/bcurtis/git/BrianCurtis-NOAA/NCEPLIBS-ncio/src/module_ncio.f90(' since
STOP 99
add line in cmake/PackageConfig.cmake.in
find_dependency(MPI REQUIRED)
add line in
cmake/PackageConfig.cmake.in
find_dependency(MPI REQUIRED)
Should we make NetCDF REQUIRED and not specific to Fortran? or should the MPI Cal be specific to Fotran?
add line in
cmake/PackageConfig.cmake.in
find_dependency(MPI REQUIRED)
Should we make NetCDF REQUIRED and not specific to Fortran? or should the MPI Cal be specific to Fotran?
REQUIRED = Yes.
If you don't specify Fortran, you won't get the Fortran NetCDF API.
MPI find package works differently and you don't need COMPONENT Fortran
for MPI.
The MPI test hangs on my macOS with gfortran-10 and mpich.
2: Test command: /opt/modules/clang-12.0.5/mpich/3.3.2/bin/mpiexec "-n" "4" "/Users/rmahajan/scratch/nceplibsWork/nceplibs-ncio/build-2/tests/tst_ncio_mpi"
2: Test timeout computed to be: 60
2: *** Testing NCEPLIBS-ncio with MPI.
2: *** Test creation of new dataset from template...
2: *** Test that number of variables,dimensions,attributes is read...
2: *** Test read of variable data...
2: *** Test write of variable data...
2/2 Test #2: tst_ncio_mpi .....................***Timeout 60.15 sec
@CoryMartin-NOAA if i build hpc-stack netcdf on my local machine I can get the MPI tests to pass. It does not pass the CI, wondering what may be the cause, but running out of ideas.
@BrianCurtis-NOAA The NetCDF library in the github CI does not have parallel enabled:
-- Found NetCDF: /usr/include (found version "4.7.3") found components: Fortran
-- FindNetCDF defines targets:
-- - NetCDF_VERSION [4.7.3]
**-- - NetCDF_PARALLEL [FALSE]**
-- - NetCDF_C_CONFIG_EXECUTABLE [/bin/nc-config]
-- - NetCDF::NetCDF_C [SHARED] [Root: /usr] Lib: /usr/lib/x86_64-linux-gnu/libnetcdf.so
-- - NetCDF_Fortran_CONFIG_EXECUTABLE [/bin/nf-config]
-- - NetCDF::NetCDF_Fortran [SHARED] [Root: /usr] Lib: /usr/lib/x86_64-linux-gnu/libnetcdff.so
I am not sure how to do that with apt-get
and brew
.
@edwardhartnett Do you know how to get the netcdf library in the github CI to use parallel?
@edwardhartnett is on holiday until early August.
NetCDF/HDF5 needs to be built manually, instead of using apt
.
We build it manually in the UFS_UTILS CI here: https://github.com/NOAA-EMC/UFS_UTILS/blob/develop/.github/workflows/netcdf-versions.yml
You could copy and paste that here and remove the NetCDF matrix.
I am happy to do that, but i haven't messed with the github CI before. Let me know if you would rather do it.
I can do it real fast
Fails on MacOS. I don't have MacOS to try on. @aerorahul you did mention it times out with MPI on your own machine. Any ideas on what we could try?
Fails on MacOS. I don't have MacOS to try on. @aerorahul you did mention it times out with MPI on your own machine. Any ideas on what we could try?
The MPI test also times out in the macOS action, not just my machine.
Interestingly, it times out for me on macOS using Intel/OpenMPI too. Not just GNU/MPICH.
Interestingly, it times out for me on macOS using Intel/OpenMPI too. Not just GNU/MPICH.
It is something in write_vardata and specifically in nf90_put_var. That's where I last stopped in the debugger.
My guess is there is a process doing nothing, the writes are generally MPI barriers so it will hang if at least one of the PEs has not written to the file.
Try simplfiyng the test program by commenting out everything it does and see if it passes macos.
Then, step by step, uncomment the code, so that at first you are just creating/closing the file. Then, one step at a time, start uncommenting out the lines that do things to the netCDF file, until you find the exact line of code that is causing the problem.
As Cory notes, all processing elements must be calling the write routine, because this is a collective action. @BrianCurtis-NOAA you need to figure out which pe is not calling the write code.
Another way to proceed is to look in the ncio code and just copy the netCDF calls directly into your test code, to cut ncio out of the loop. Once you can get the netCDF code to work, you can then see what is different between the netCDF code and the ncio code.
Is there a way for me to test this on MacOS other than making code changes in the repo and letting the github actions test it?
@BrianCurtis-NOAA checking in changes and seeing what GitHub actions does is how I work on MacOS. It's not perfect but it gets the job done.
@CoryMartin-NOAA It looks as if the write_vardata works the first call, but when that returns finished and moved onto another write_vardata it doesn't like something. You mentioned a process doing nothing, would this still hold true here? I'm still confused why this is specific to the MacOS and not GNU.
@aerorahul Did you have a fork/repo for your cmake changes you gave me a while back? I don't wish to duplicate work you've already done
@edwardhartnett what does error 9 Bad file descriptor
mean?
2: Test timeout computed to be: 60
2: *** Testing NCEPLIBS-ncio with MPI.
2: *** Testing funtion open_dataset with paropen=.true.
2: 9 Bad file descriptor
Do I need to write my own file to read from? I was using a copy of the one used for the tst_ncio.F90
EDIT: This is only a part of the output from the failed test..
integer, parameter :: YT = 128, XT = 256
integer, parameter :: HALF_YT = YT/2, HALF_XT = XT/2
...
! float pressfc(time, grid_yt, grid_xt)
! dimensions:
! grid_xt = 256 ;
! grid_yt = 128 ;
! pfull = 64 ;
! phalf = 65 ;
! time = UNLIMITED ; // (1 currently)
! member = 2 ;
count = (/ 1, HALF_YT, HALF_XT /)
if (my_rank .eq. 0) then
start = (/ 1, 1, 1 /)
else if (my_rank .eq. 1) then
start = (/ 1, HALF_YT + 1, 1 /)
else if (my_rank .eq. 2) then
start = (/ 1, 1, HALF_XT + 1 /)
else if (my_rank .eq. 3) then
start = (/ 1, HALF_YT + 1, HALF_XT + 1 /)
endif
call read_vardata(dsetin, 'pressfc', values_3d, ncstart=start, nccount=count, errcode=errcode)
call check(errcode)
Running test
2/2 Test #2: tst_ncio_mpi .....................***Failed 2.25 sec
*** Testing NCEPLIBS-ncio with MPI.
*** Testing function open_dataset with paropen=.true.
-57 NetCDF: Start+count exceeds dimension bound
STOP 99
-40 NetCDF: Index exceeds dimension bound
STOP 99
-40 NetCDF: Index exceeds dimension bound
STOP 99
Error: NetCDF: Index exceeds dimension bound
*** Test that number of variables,dimensions,attributes is read...
*** Test read of variable data...
Error: NetCDF: Start+count exceeds dimension bound
Error: NetCDF: Index exceeds dimension bound
Error: NetCDF: Start+count exceeds dimension bound
-57 NetCDF: Start+count exceeds dimension bound
STOP 99
@edwardhartnett not sure where my count is off here. I can push my changes to the test if you want to look at the whole file.
@edwardhartnett I think i've found out what was backwards:
Is ncdump a bad tool, or is read_vardata doing it backwards?
@BrianCurtis-NOAA ncdump is C-style, Fortran is the opposite (column/row ordering)
@BrianCurtis-NOAA ncdump is C-style, Fortran is the opposite (column/row ordering)
Thanks! I had no idea that ncdump was C-based.
Yes, ncdump is C based, Full documentation can be found here https://www.unidata.ucar.edu/software/netcdf/documentation/NUG/netcdf_utilities_guide.html#ncdump_guide
I always forget the C/Fortran ordering issue...
As the error message says, index plus count is too big. Put all your start/counts in fortran order.
Adds an MPI version of tst_ncio
Closes #44