NCAR / ParallelIO

A high-level Parallel I/O Library for structured grid applications
Apache License 2.0
134 stars 52 forks source link

Clean import of code for GDAL in ESMF. Compiles/runs in ESMF #1981

Open msl3v opened 7 months ago

msl3v commented 7 months ago

Need to diagnose parallelization issues.

jedwards4b commented 7 months ago

Also you don't need to create a new PR with additional changes, just push them back to the same branch. How is the GDAL library available? Can it be obtained with apt-git? Just trying to figure out how to add it to the github testing workflow.

msl3v commented 7 months ago

GDAL is available in several forms. It is available in my Fedora distro via dnf as an RPM. The source is also available, and is not complex to compile. We have it on the NCCS machines here at NASA.

msl3v commented 7 months ago

@jedwards4b, the ESMF test results show that for a file with 4 points, all 4 points are being read correctly on io_root. The data being read are [1 2 3 4]. If I run on 4 procs, the local buffers are [1] [1] [1] [1]. If I run on 2 procs, the local buffers are [1 2] [1 2]. So it looks like MPI is sending/recving but that indexing isn't correct.

jedwards4b commented 7 months ago

If you could provide your test program I'll give it a try. Please remove reference to ESMF, let's take that library out of the equation for now.

msl3v commented 7 months ago

ESMF does all the calling to PIO, and it would be a complex task (for me) to uncouple. ESMF starts in Fortran, can then calls into C bindings, sets up arrays etc, and then calls PIO. I'll see if I can isolate the PIO-specifc pieces but we may have to iterate.

msl3v commented 7 months ago

Is there an existing test in PIO that I could modify for this?

jedwards4b commented 7 months ago

Yes look in the test directories, specifically in tests/cunit for a number of possible tests. test_darray.c might be a good one to try.

msl3v commented 7 months ago

I've added test_gdal.c in the most recent commit to github.com/msl3v/ParallelIO:msl3v/GDAL_clean It segfaults in pio_swapm() and I'm not able to diagnose why. If you have any insight it would be appreciated. Thank you.

jedwards4b commented 7 months ago

Great thanks - I have a day full of meetings today, but I'll try to get to this as soon as I can.

jedwards4b commented 7 months ago

What version of gdal is required - my system has gdal/3.7.1 built with gcc - can I test with that?

msl3v commented 7 months ago

3.7.1 works. Here's my cmake cmd

env CC=mpicc FC=mpifort CFLAGS="-g -fPIC" cmake -DCMAKE_INSTALL_PREFIX=../Install -Wno-dev -DPIO_ENABLE_TIMING=OFF -DPIO_ENABLE_EXAMPLES=OFF -DPIO_ENABLE_FORTRAN=OFF -DPIO_ENABLE_LOGGING=ON -DCMAKE_BUILD_TYPE=DEBUG -DNetCDF_C_INCLUDE_DIR=/usr/lib64/gfortran/modules -DNetCDF_C_LIBRARY=/usr/lib64/libnetcdf.so -DWITH_PNETCDF=off -DGDAL_INCLUDE_DIR=/include -DGDAL_LIBRARY=/libgdal.so ../ParallelIO

jedwards4b commented 7 months ago

The first problem is because your test doesn't run from the source directory and so it can't find file data/cb_2018_us_region_20m.shp in the relative path. i'll look into how to do this in the build, but for now you can create a link from tests/cunit/data to build/tests/cunit/data.

msl3v commented 7 months ago

Interesting. It opens, queries and reads data from the file in my local version. Curious what's different

msl3v commented 7 months ago

Is it maybe that I'm running the test directly from tests/cunit? I'll create the link and push the change.

I don't see a build dir after cmake, make install, make tests. Is 'build' in this case simply the path to src?

jedwards4b commented 7 months ago

Looks like you found it - that's great!

msl3v commented 7 months ago

OK. The issue with the segfault was that the recvbuf was NULL. It was fixed with a simple line in test_gdal.c I'll keep banging on it to see what the recvbuf data look like.

msl3v commented 7 months ago

@jedwards4b In my pio_log output, for 4 data points on 4 MPI tasks, I get the following output:

0 sendcounts[0] = 1 sdispls[0] = 0 sendtypes[0] = 27742288 recvcounts[0] = 1 rdispls[0] = 0 recvtypes[0] = 27745072 0 sendcounts[1] = 1 sdispls[1] = 0 sendtypes[1] = 27743024 recvcounts[1] = 0 rdispls[1] = 0 recvtypes[1] = 2085777792 0 sendcounts[2] = 1 sdispls[2] = 0 sendtypes[2] = 27743696 recvcounts[2] = 0 rdispls[2] = 0 recvtypes[2] = 2085777792 0 sendcounts[3] = 1 sdispls[3] = 0 sendtypes[3] = 27744368 recvcounts[3] = 0 rdispls[3] = 0 recvtypes[3] = 2085777792

Is this telling me that only one data point is being received (line 1)?

(addenum: this is associated with the call to rearrange_io2comp())

(addendum 2: the data vector at the end of the test is [1 0 0 0]. The data vector read from file is [1 2 3 4])

msl3v commented 7 months ago

@jedwards4b I'm at a loss. I can keep trying but it's just not clear what the problem might be. Is it the case that the data from the serially-read vector are broadcast via the MPI_AlltoAllw() call in rearrange_io2comp()->pio_swapm()?

jedwards4b commented 7 months ago

Yes that's correct, I can try to have another look at this tomorrow.

msl3v commented 7 months ago

Is it the case, then, that the first line in the output above 0 sendcounts[0] = 1 sdispls[0] = 0 sendtypes[0] = 27742288 recvcounts[0] = 1 rdispls[0] = 0 recvtypes[0] = 27745072 is saying that only task 0 is set to receive any data, since all the rest of the receive params are zero?

jedwards4b commented 7 months ago

yes, I will try to look at this again after lunch.

msl3v commented 7 months ago

No hurry. I'm just banging away at the code & posting what comes up.

msl3v commented 7 months ago

@jedwards4b just an update: Still no progress. It looks like the iodesc->map might not be getting set up correctly. MPI doesn't know to send what element to which proc. ... maybe. But I still haven't interpreted what is actually happening here. I'll keep trying.

jedwards4b commented 7 months ago

I think that you should walk through the netcdf serial test step by step - it should be exactly the same as this one.

msl3v commented 7 months ago

@jedwards4b It's not clear which test is the mpi serial test. I've modeled the GDAL test off test_darray.c Is the serial test test_simple?

Also, in a totally vanilla version of ParallelIO, all darray and the test_simple tests fail on my laptop. So maybe that is itself part of the problem. :-/

jedwards4b commented 7 months ago

That does seem like a problem. On my system all of the tests are passing except the gdal test. In the test_darray.c there is a loop over num_flavors at line 110. One of those flavors is netcdf-serial (PIO_IOTYPE_NETCDF). Your test should behave exactly the same as test_darray.c does with this flavor.

msl3v commented 7 months ago

Switched to another system. All tests pass (from remotes/origin/main). It seems to be a local MPI problem on my laptop. Thanks for the pointer re test_darray.c. Onward.

msl3v commented 6 months ago

Yeah. It may simply be that my laptop openMPI was upgraded (to v 4.1.9) and no longer works with ParallelIO. Switched to MPICH and things seem to be working, including the GDAL read :-/