Open tpg2114 opened 2 years ago
Oh, the other useful bit of info -- this is on version 21.11
because later versions broke my build system, they were unable to find HDF5. It looks like the HDF5 routines were heavily modified since 21.11 so maybe I need to figure out how to unbreak my build with later versions...
Thank you for reporting this. @houjun, do you have any ideas?
One thing that might help is to try to trigger this using amrex/Tests/HDF5Benchmark
. If you change the distribution map in that test so that 1 process has no boxes, can you trigger the error there?
Thanks for the tip @atmyers -- let me see if I can figure out how to do that and get back to you.
In the mean time, I changed the data I was writing to level_number*10000 + rank*100 + comp
so I could see what, exactly, was overwriting what. I ran with 36 processors. The first box that should have been written was on rank 3 (ranks 0, 1, 2 did not have boxes) but what ended up in that position was data from rank 35.
I started playing with printing out the offsets, write sizes, and sorted boxes inside AMReX but haven't gotten too far into that yet.
Hi @tpg2114 , seems like there is a bug with the HDF5 dataspace selection for the write, do you have reproducer codes that I run and debug?
@houjun I don't at the moment. I'm going to see if I can get the amrex/Tests/HDF5Benchmark
to do it. Rebuilding things with testing turned on now to try it out, I'll let you know if I can replicate it there.
Okay, @houjun and @atmyers I think I have an example that breaks. I hacked apart the HDF5Benchmark
case to make this happen, here is the main.cpp
I modified, converted to a text file and the
input file converted to a text file. I ran with 36 processors just to really ensure it didn't have boxes for every processor. In Visit, I can see:
Level 1
Level 2
where it is clear the value in the upper-back corner is not being written correctly -- it should be 2, not 0.
Here is the Boxlib3D plot file written at the same time, it is correct:
And here is possibly-relevant configuration options:
-- The C compiler identification is IntelLLVM 2021.4.0
-- The CXX compiler identification is IntelLLVM 2021.4.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /opt/intel/oneapi/compiler/2021.4.0/linux/bin/icx - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /opt/intel/oneapi/compiler/2021.4.0/linux/bin/icpx - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- CMake version: 3.21.3
-- AMReX installation directory: <redacted>
-- Building AMReX with AMReX_SPACEDIM = 3
-- Configuring AMReX with the following options enabled:
-- AMReX_FORTRAN
-- AMReX_PRECISION = DOUBLE
-- AMReX_MPI
-- AMReX_AMRLEVEL
-- AMReX_FORTRAN_INTERFACES
-- AMReX_LINEAR_SOLVERS
-- AMReX_PARTICLES
-- AMReX_PARTICLES_PRECISION = DOUBLE
-- AMReX_HDF5
-- AMReX_PROBINIT
-- The Fortran compiler identification is Intel 2021.4.0.20210910
-- Detecting Fortran compiler ABI info
-- Detecting Fortran compiler ABI info - done
-- Check for working Fortran compiler: /opt/intel/oneapi/compiler/2021.4.0/linux/bin/intel64/ifort - skipped
-- Checking whether /opt/intel/oneapi/compiler/2021.4.0/linux/bin/intel64/ifort supports Fortran 90
-- Checking whether /opt/intel/oneapi/compiler/2021.4.0/linux/bin/intel64/ifort supports Fortran 90 - yes
-- Looking for pthread.h
-- Looking for pthread.h - found
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success
-- Found Threads: TRUE
-- Found MPI_C: /opt/mpi/OpenMPI4.0.3_intel2021.4/lib64/libmpi.so (found version "3.1")
-- Found MPI_CXX: /opt/mpi/OpenMPI4.0.3_intel2021.4/lib64/libmpi.so (found version "3.1")
-- Found MPI_Fortran: /opt/mpi/OpenMPI4.0.3_intel2021.4/lib64/libmpi_usempi_ignore_tkr.so (found version "3.1")
-- Found MPI: TRUE (found version "3.1") found components: C CXX Fortran
-- Detecting Fortran/C Interface
-- Detecting Fortran/C Interface - Found GLOBAL and MODULE mangling
-- Fortran name mangling scheme: UNDERSCORE (lower case, append underscore)
@tpg2114 I'm able to write the data and generate the same plot with your modified codes, will start debugging.
@houjun Great -- I think I see what is happening, but I don't quite understand enough about the code to know why yet. I went back to my original simulation and I printed out the rank, number of boxes on that rank, ch_offset
, and hs_procsize
for the level that was broken. Without tallying it all up, here is what happens:
Level Rank Nboxes ch_offset hs_procsize
AM: 1 0 0 0 0
AM: 1 1 0 3584 0
AM: 1 2 0 7168 0
AM: 1 3 1 10752 3584
AM: 1 4 1 14336 3584
AM: 1 5 1 17920 3584
AM: 1 6 1 21504 3584
AM: 1 7 1 28672 3584
AM: 1 8 1 35840 3584
AM: 1 9 1 43008 7168
AM: 1 10 2 50176 7168
AM: 1 11 1 57344 7168
AM: 1 12 1 64512 7168
...
AM: 1 32 1 193536 3584
AM: 1 33 1 0 7168
AM: 1 34 1 0 7168
AM: 1 35 1 0 7168
So when it calculates the offsets for each processor, it isn't accounting for the fact that if there are no boxes, there is no offset. That means the running tally for later processors is incorrect. I think the value of 0
for the offset in ranks 33-35 shows up due to a buffer overrun and that might explain why some of those other core counts had errors or hangs -- they probably got something non-zero-but-garbage in that value.
@tpg2114 I believe I found the problem, it is due to the wrong offset/size to rank assignment when there are ranks with no box. Here is a patch with the fix, could you please apply it to amrex and run your application code? It works for the HDF5Benchmark code with 36 ranks. It should also fix the hanging issue you mentioned earlier. patch
@houjun Some good news and some bad news -- the patch does fix the writing issue and I can now get the expected images from my dataset! It does not, however, fix the hanging with 31 processors for my code. It hangs in the call to:
ret = H5Dwrite(dataset, H5T_NATIVE_DOUBLE, memdataspace, dataspace, dxpl_col, a_buffer.dataPtr());
This happens on a level where all processors have at least one box, so it might have a different root cause. It is reaching the call with all processors, but hangs inside it -- so it doesn't look like an issue of a processor not making it to the call.
@tpg2114 hmm... Any chance you can reproduce it with the HDF5Benchmark code? Here are two things to try:
I'll try to reproduce it -- I did add a std::cout
right before the call and it shows all 31 processors hitting that call and none leaving it, for level 2 (2nd from finest level).
But I'll see if I can get the HDF5Benchmark to reproduce it. It will likely be tomorrow before I can work on it more, though.
Switching to dxpl_ind
with 31 processors in my code does make it through that call and the file looks to be written correctly.
I am not well-versed in MPI-IO or parallel HDF5 (this is my first adventure in both) -- is the independent write vs. collective write just a performance difference, or is there a data integrity implication or something else?
If everything else is the same, independent and collective write should only be a performance difference. Since it works fine with independent but not collective, there could be issues either with my code or HDF5. I see you are using HDF5 1.12.1, do you have any other HDF5 versions (e.g.1.10.x) installed that you can test with?
I don't at the moment -- our code uses a CMake superbuild to assemble all the dependencies and main code, and finding HDF5 is always a horrible pain, so we're using the CMake-ified HDF5 to help make it easier. Unfortunately, that means I have to patch AMReX using CMake to change the version it looks for (at least with AMReX 21.11
), while newer versions of AMReX changed how it finds HDF5 in CMake and that, in turn, broke my build chain.
Anyway, long-story short, I don't have an easy build environment with HDF5 1.10.* set up. But since there is no data integrity issue with the independent write, I'll get my code in a working state so I can burn up some hours on the HPC systems over the holiday.
Once I get that done, I will try to set up a clean environment to do better testing/debugging in than I have now. It'll take me a little bit of time to do that though, but maybe Wed. or Thurs. I can try it with a different HDF5 version.
It looks like the CMake HDF5 for versions newer than 1.10.3 might work correctly, I'll try 1.10.8 when I get a chance tomorrow.
Sounds good, please let me know what you find with 1.10.8.
Hi @houjun -- Collective IO with HDF5 1.10.8 also hangs on the 31 processor test with my code. Independent IO has no problem and writes the file correctly then moves into the solver iterations.
I can't seem to reproduce it with the HDF5Benchmark test, though.
@tpg2114 could you print the following values before the hanging H5Dwrite: myProc, level, ch_offset[0], hs_procsize[0], hs_allprocsize[0]
@houjun Sure, here it is:
H5DWrite: 0 2 0 46080 1428480
H5DWrite: 11 2 493056 27648 1428480
H5DWrite: 12 2 520704 27648 1428480
H5DWrite: 20 2 921600 36864 1428480
H5DWrite: 8 2 368640 41472 1428480
H5DWrite: 9 2 410112 41472 1428480
H5DWrite: 10 2 451584 41472 1428480
H5DWrite: 16 2 728064 41472 1428480
H5DWrite: 19 2 880128 41472 1428480
H5DWrite: 26 2 1198080 46080 1428480
H5DWrite: 30 2 1382400 46080 1428480
H5DWrite: 28 2 1290240 46080 1428480
H5DWrite: 4 2 184320 46080 1428480
H5DWrite: 2 2 92160 46080 1428480
H5DWrite: 6 2 276480 46080 1428480
H5DWrite: 18 2 824832 55296 1428480
H5DWrite: 22 2 1013760 46080 1428480
H5DWrite: 24 2 1105920 46080 1428480
H5DWrite: 29 2 1336320 46080 1428480
H5DWrite: 1 2 46080 46080 1428480
H5DWrite: 3 2 138240 46080 1428480
H5DWrite: 5 2 230400 46080 1428480
H5DWrite: 7 2 322560 46080 1428480
H5DWrite: 13 2 548352 46080 1428480
H5DWrite: 17 2 769536 55296 1428480
H5DWrite: 23 2 1059840 46080 1428480
H5DWrite: 25 2 1152000 46080 1428480
H5DWrite: 27 2 1244160 46080 1428480
H5DWrite: 14 2 594432 69120 1428480
H5DWrite: 21 2 958464 55296 1428480
H5DWrite: 15 2 663552 64512 1428480
If I were to call amrex_print(vars(level)%ba)
and likewise for the distromap and gave those to you, could those be read in the benchmark test to maybe reproduce the issue there?
@tpg2114 Thanks! These should be enough to reproduce the issue since everything seems to be fine until H5Dwrite. I have created a simple HDF5 program that uses the offset and size you provided. I ran it with 31 ranks and it works with both dxpl_col and dxpl_ind, so I'm still not sure why you got the hang. Could you try running this code on the machine where you ran your application? code
@houjun This MWE does in fact hang on my machine with 31 processors using both HDF5 1.12.1 and HDF5 1.10.8.
I will try this on a few different HPC machines and see if I can identify whether it is compiler-related or MPI-related. I am using the new-style LLVM-based Intel compilers (2021.4) with OpenMPI 4.0.3, with GNU 11.2.1 providing Intel with the standard libraries. I think I have access to enough other combinations of things on HPC that I can isolate each of those.
I appreciate the help!
Okay -- so I've run it on several machines now and it appears the problem is in OpenMPI.
I've built and run with MPICH on my machine with the same compilers as the case that hangs, and it works just fine.
I guess I'm going to replace OpenMPI in our builds when MPI isn't already installed. I'm happy to help debug more if there are possible environment variable settings or mpirun
options for OpenMPI that might shed some light on it.
@tpg2114 Hearing about OpenMPI being the cause of the hanging issue reminds me that I had some issues with OpenMPI using the ROMIO driver, could you try running the program with ompio driver and see if it still hangs: mpirun --mca io ompio -np 31
@houjun It does hang with mpirun --mca io ompio -np 31
, and looking at the documentation, the io ompio
option is the default driver for my build. So I tried it with mpirun --mca io romio321 -np 31
, which is the only other IO MCA my build has.
That does not hang.
So the problem seems to be in the OMPIO driver specifically, whereas the ROMIO driver works correctly.
Further testing reveals it is the fcoll
MCA that is the problem. With the OMPIO driver, it defaults to the vulcan
fcoll parameter. When I run with the OMPIO driver with --mca fcoll dynamic_gen2
, --mca fcoll individual
, or --mca fcoll two_phase
, it works fine and doesn't hang. If I run with --mca fcoll vulcan
, it hangs.
And with those search terms in hand, it looks like this is a known bug in OpenMPI:
https://www.mail-archive.com/users@lists.open-mpi.org/msg33511.html
although it was marked fixed in 4.0.3 here, but it doesn't seem to be fixed for me.
OK, I probably remembered it wrong and it was fine to use romio but not ompio. I'll open an issue in the HDF5 github repo and see if they know anything.
I am having an odd issue that I can't seem to track down. I have incorrect data being written to HDF5 plot files. It happens once I exceed certain (case-specific) core counts. It looks like it happens when not all processors have boxes on a given level -- in other words, if any rank does not have a box on a given level, that level is written incorrectly to the HDF5 file.
While debugging, I changed my code to do:
so I could compare the two types of plot files back to back. Attached are the two images I get with visit:
Boxlib3D Plot file
HDF5 Plot file
Doing some more digging, the problem in the HDF5 data writes seems to only be on Level 1 of this file. Levels 0, 2, and 3 are all okay. Looking in the file, the first box in the
level_1/boxes
field is(48, 0, 0) (55, 7, 7)
, which gives a block size of 512, and there are 7 components, which gives a write size of3584
, which matches the offsets inlevel_1/data:offsets=0
. However, if I look at my data inlevel_1/data:datatype=0
between 0 and 3583, it looks like it is writing out1024
entries for each of my components and not512
.There are other boxes with a block size of 1024, however, so it seems like it is mixing up which box it is writing somehow. The multifabs themselves seem to be fine, otherwise the boxlib3d format of the plot file would not be correct.
While playing with this some more while writing this issue, it seems like there may be other issues at play. For this problem, if I run with 30 processors, it writes the HDF5 file, incorrectly but successfully. If I run with 31 processors, the call to
WriteMultiLevelPlotfileHDF5
hangs during anMPI
call. If I run with 32 processors, I get HDF5 errors on a single processor that seem to be caused by:but if I run with 33 processors, the file writes incorrectly but successfully.
Any advice on how to debug this would be helpful!