AMReX-Codes / amrex

AMReX: Software Framework for Block Structured AMR
https://amrex-codes.github.io/amrex
Other
547 stars 347 forks source link

Incorrect plot files with HDF5 #2491

Open tpg2114 opened 2 years ago

tpg2114 commented 2 years ago

I am having an odd issue that I can't seem to track down. I have incorrect data being written to HDF5 plot files. It happens once I exceed certain (case-specific) core counts. It looks like it happens when not all processors have boxes on a given level -- in other words, if any rank does not have a box on a given level, that level is written incorrectly to the HDF5 file.

While debugging, I changed my code to do:

    amrex::WriteMultiLevelPlotfileHDF5(name, nlevs, mfarr, varnamearr, geomarr,
                                       time, lsarr, rrarr);
    amrex::WriteMultiLevelPlotfile(name, nlevs, mfarr, varnamearr, geomarr,
                                   time, lsarr, rrarr);

so I could compare the two types of plot files back to back. Attached are the two images I get with visit:

Boxlib3D Plot file Boxlib3D plot file

HDF5 Plot file HDF5 plot file

Doing some more digging, the problem in the HDF5 data writes seems to only be on Level 1 of this file. Levels 0, 2, and 3 are all okay. Looking in the file, the first box in the level_1/boxes field is (48, 0, 0) (55, 7, 7), which gives a block size of 512, and there are 7 components, which gives a write size of 3584, which matches the offsets in level_1/data:offsets=0. However, if I look at my data in level_1/data:datatype=0 between 0 and 3583, it looks like it is writing out 1024 entries for each of my components and not 512.

There are other boxes with a block size of 1024, however, so it seems like it is mixing up which box it is writing somehow. The multifabs themselves seem to be fine, otherwise the boxlib3d format of the plot file would not be correct.


While playing with this some more while writing this issue, it seems like there may be other issues at play. For this problem, if I run with 30 processors, it writes the HDF5 file, incorrectly but successfully. If I run with 31 processors, the call to WriteMultiLevelPlotfileHDF5 hangs during an MPI call. If I run with 32 processors, I get HDF5 errors on a single processor that seem to be caused by:

HDF5-DIAG: Error detected in HDF5 (1.12.1) MPI-process 30:
  #000: <redacted>/hdf5-1.12.1/src/H5Dio.c line 291 in H5Dwrite(): can't write data
    major: Dataset
    minor: Write failed
  #001: <redacted>/hdf5-1.12.1/src/H5VLcallback.c line 2113 in H5VL_dataset_write(): dataset write failed
    major: Virtual Object Layer
    minor: Write failed
  #002: <redacted>/hdf5-1.12.1/src/H5VLcallback.c line 2080 in H5VL__dataset_write(): dataset write failed
    major: Virtual Object Layer
    minor: Write failed
  #003: <redacted>/H5VLnative_dataset.c line 200 in H5VL__native_dataset_write(): could not get a validated dataspace from file_space_id
    major: Invalid arguments to routine
    minor: Bad value
  #004: <redacted>/hdf5-1.12.1/src/H5S.c line 266 in H5S_get_validated_dataspace(): selection + offset not within extent
    major: Dataspace
    minor: Out of range

but if I run with 33 processors, the file writes incorrectly but successfully.

Any advice on how to debug this would be helpful!

tpg2114 commented 2 years ago

Oh, the other useful bit of info -- this is on version 21.11 because later versions broke my build system, they were unable to find HDF5. It looks like the HDF5 routines were heavily modified since 21.11 so maybe I need to figure out how to unbreak my build with later versions...

atmyers commented 2 years ago

Thank you for reporting this. @houjun, do you have any ideas?

One thing that might help is to try to trigger this using amrex/Tests/HDF5Benchmark. If you change the distribution map in that test so that 1 process has no boxes, can you trigger the error there?

tpg2114 commented 2 years ago

Thanks for the tip @atmyers -- let me see if I can figure out how to do that and get back to you.

In the mean time, I changed the data I was writing to level_number*10000 + rank*100 + comp so I could see what, exactly, was overwriting what. I ran with 36 processors. The first box that should have been written was on rank 3 (ranks 0, 1, 2 did not have boxes) but what ended up in that position was data from rank 35.

I started playing with printing out the offsets, write sizes, and sorted boxes inside AMReX but haven't gotten too far into that yet.

houjun commented 2 years ago

Hi @tpg2114 , seems like there is a bug with the HDF5 dataspace selection for the write, do you have reproducer codes that I run and debug?

tpg2114 commented 2 years ago

@houjun I don't at the moment. I'm going to see if I can get the amrex/Tests/HDF5Benchmark to do it. Rebuilding things with testing turned on now to try it out, I'll let you know if I can replicate it there.

tpg2114 commented 2 years ago

Okay, @houjun and @atmyers I think I have an example that breaks. I hacked apart the HDF5Benchmark case to make this happen, here is the main.cpp I modified, converted to a text file and the input file converted to a text file. I ran with 36 processors just to really ensure it didn't have boxes for every processor. In Visit, I can see:

Level 1 visit0001

Level 2 visit0000

where it is clear the value in the upper-back corner is not being written correctly -- it should be 2, not 0.

Here is the Boxlib3D plot file written at the same time, it is correct: visit0002

tpg2114 commented 2 years ago

And here is possibly-relevant configuration options:

-- The C compiler identification is IntelLLVM 2021.4.0                                                                                                                                                                                                                                    
-- The CXX compiler identification is IntelLLVM 2021.4.0                                                                                                                                                                                                                                  
-- Detecting C compiler ABI info                                                                                                                                                                                                                                                          
-- Detecting C compiler ABI info - done                                                                                                                                                                                                                                                   
-- Check for working C compiler: /opt/intel/oneapi/compiler/2021.4.0/linux/bin/icx - skipped                                                                                                                                                                                              
-- Detecting C compile features                                                                                                                                                                                                                                                           
-- Detecting C compile features - done                                                                                                                                                                                                                                                    
-- Detecting CXX compiler ABI info                                                                                                                                                                                                                                                        
-- Detecting CXX compiler ABI info - done                                                                                                                                                                                                                                                 
-- Check for working CXX compiler: /opt/intel/oneapi/compiler/2021.4.0/linux/bin/icpx - skipped                                                                                                                                                                                           
-- Detecting CXX compile features                                                                                                                                                                                                                                                         
-- Detecting CXX compile features - done                                                                                                                                                                                                                                                  
-- CMake version: 3.21.3                                                                                                                                                                                                                                                                  
-- AMReX installation directory: <redacted>                                                                                                                                                                                                                
-- Building AMReX with AMReX_SPACEDIM = 3                                                                                                                                                                                                                                                 
-- Configuring AMReX with the following options enabled:                                                                                                                                                                                                                                  
--    AMReX_FORTRAN                                                                                                                                                                                                                                                                       
--    AMReX_PRECISION = DOUBLE                                                                                                                                                                                                                                                            
--    AMReX_MPI                                                                                                                                                                                                                                                                           
--    AMReX_AMRLEVEL                                                                                                                                                                                                                                                                      
--    AMReX_FORTRAN_INTERFACES                                                                                                                                                                                                                                                            
--    AMReX_LINEAR_SOLVERS                                                                                                                                                                                                                                                                
--    AMReX_PARTICLES                                                                                                                                                                                                                                                                     
--    AMReX_PARTICLES_PRECISION = DOUBLE                                                                                                                                                                                                                                                  
--    AMReX_HDF5                                                                                                                                                                                                                                                                          
--    AMReX_PROBINIT                                                                                                                                                                                                                                                                      
-- The Fortran compiler identification is Intel 2021.4.0.20210910                                                                                                                                                                                                                         
-- Detecting Fortran compiler ABI info                                                                                                                                                                                                                                                    
-- Detecting Fortran compiler ABI info - done                                                                                                                                                                                                                                             
-- Check for working Fortran compiler: /opt/intel/oneapi/compiler/2021.4.0/linux/bin/intel64/ifort - skipped                                                                                                                                                                              
-- Checking whether /opt/intel/oneapi/compiler/2021.4.0/linux/bin/intel64/ifort supports Fortran 90                                                                                                                                                                                       
-- Checking whether /opt/intel/oneapi/compiler/2021.4.0/linux/bin/intel64/ifort supports Fortran 90 - yes                                                                                                                                                                                 
-- Looking for pthread.h                                                                                                                                                                                                                                                                  
-- Looking for pthread.h - found                                                                                                                                                                                                                                                          
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD                                                                                                                                                                                                                                                
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success                                                                                                                                                                                                                                      
-- Found Threads: TRUE                                                                                                                                                                                                                                                                    
-- Found MPI_C: /opt/mpi/OpenMPI4.0.3_intel2021.4/lib64/libmpi.so (found version "3.1")                                                                                                                                                                                                   
-- Found MPI_CXX: /opt/mpi/OpenMPI4.0.3_intel2021.4/lib64/libmpi.so (found version "3.1")                                                                                                                                                                                                 
-- Found MPI_Fortran: /opt/mpi/OpenMPI4.0.3_intel2021.4/lib64/libmpi_usempi_ignore_tkr.so (found version "3.1")                                                                                                                                                                           
-- Found MPI: TRUE (found version "3.1") found components: C CXX Fortran                                                                                                                                                                                                                  
-- Detecting Fortran/C Interface                                                                                                                                                                                                                                                          
-- Detecting Fortran/C Interface - Found GLOBAL and MODULE mangling                                                                                                                                                                                                                       
-- Fortran name mangling scheme: UNDERSCORE (lower case, append underscore)
houjun commented 2 years ago

@tpg2114 I'm able to write the data and generate the same plot with your modified codes, will start debugging.

tpg2114 commented 2 years ago

@houjun Great -- I think I see what is happening, but I don't quite understand enough about the code to know why yet. I went back to my original simulation and I printed out the rank, number of boxes on that rank, ch_offset, and hs_procsize for the level that was broken. Without tallying it all up, here is what happens:

    Level Rank Nboxes ch_offset hs_procsize
AM: 1      0     0       0          0
AM: 1      1     0      3584        0                                                                                                                                                                                                                                                                          
AM: 1      2     0      7168        0                                                                                                                                                                                                                                                                          
AM: 1      3     1      10752       3584                                                                                                                                                                                                                                                                    
AM: 1      4     1      14336       3584                                                                                                                                                                                                                                                                    
AM: 1      5     1      17920       3584                                                                                                                                                                                                                                                                   
AM: 1      6     1      21504       3584                                                                                                                                                                                                                                                                   
AM: 1      7     1      28672       3584                                                                                                                                                                                                                                                                   
AM: 1      8     1      35840       3584                                                                                                                                                                                                                                                                    
AM: 1      9     1      43008       7168                                                                                                                                                                                                                                                                   
AM: 1      10    2      50176       7168                                                                                                                                                                                                                                                                   
AM: 1      11    1      57344       7168                                                                                                                                                                                                                                                                  
AM: 1      12    1      64512       7168  
...
AM: 1      32   1       193536   3584
AM: 1      33   1       0          7168
AM: 1      34   1       0          7168
AM: 1      35    1       0          7168

So when it calculates the offsets for each processor, it isn't accounting for the fact that if there are no boxes, there is no offset. That means the running tally for later processors is incorrect. I think the value of 0 for the offset in ranks 33-35 shows up due to a buffer overrun and that might explain why some of those other core counts had errors or hangs -- they probably got something non-zero-but-garbage in that value.

houjun commented 2 years ago

@tpg2114 I believe I found the problem, it is due to the wrong offset/size to rank assignment when there are ranks with no box. Here is a patch with the fix, could you please apply it to amrex and run your application code? It works for the HDF5Benchmark code with 36 ranks. It should also fix the hanging issue you mentioned earlier. patch

tpg2114 commented 2 years ago

@houjun Some good news and some bad news -- the patch does fix the writing issue and I can now get the expected images from my dataset! It does not, however, fix the hanging with 31 processors for my code. It hangs in the call to:

ret = H5Dwrite(dataset, H5T_NATIVE_DOUBLE, memdataspace, dataspace, dxpl_col, a_buffer.dataPtr());

This happens on a level where all processors have at least one box, so it might have a different root cause. It is reaching the call with all processors, but hangs inside it -- so it doesn't look like an issue of a processor not making it to the call.

houjun commented 2 years ago

@tpg2114 hmm... Any chance you can reproduce it with the HDF5Benchmark code? Here are two things to try:

  1. Add a printf message (may need a fflush call to avoid buffering) before that H5Dwrite and check whether all ranks reached there, I suspect some ranks may got stuck somewhere else.
  2. Change dxpl_col to dxpl_ind, which makes H5Dwrite to use MPI independent write instead of collective.
tpg2114 commented 2 years ago

I'll try to reproduce it -- I did add a std::cout right before the call and it shows all 31 processors hitting that call and none leaving it, for level 2 (2nd from finest level).

But I'll see if I can get the HDF5Benchmark to reproduce it. It will likely be tomorrow before I can work on it more, though.

tpg2114 commented 2 years ago

Switching to dxpl_ind with 31 processors in my code does make it through that call and the file looks to be written correctly.

I am not well-versed in MPI-IO or parallel HDF5 (this is my first adventure in both) -- is the independent write vs. collective write just a performance difference, or is there a data integrity implication or something else?

houjun commented 2 years ago

If everything else is the same, independent and collective write should only be a performance difference. Since it works fine with independent but not collective, there could be issues either with my code or HDF5. I see you are using HDF5 1.12.1, do you have any other HDF5 versions (e.g.1.10.x) installed that you can test with?

tpg2114 commented 2 years ago

I don't at the moment -- our code uses a CMake superbuild to assemble all the dependencies and main code, and finding HDF5 is always a horrible pain, so we're using the CMake-ified HDF5 to help make it easier. Unfortunately, that means I have to patch AMReX using CMake to change the version it looks for (at least with AMReX 21.11), while newer versions of AMReX changed how it finds HDF5 in CMake and that, in turn, broke my build chain.

Anyway, long-story short, I don't have an easy build environment with HDF5 1.10.* set up. But since there is no data integrity issue with the independent write, I'll get my code in a working state so I can burn up some hours on the HPC systems over the holiday.

Once I get that done, I will try to set up a clean environment to do better testing/debugging in than I have now. It'll take me a little bit of time to do that though, but maybe Wed. or Thurs. I can try it with a different HDF5 version.

tpg2114 commented 2 years ago

It looks like the CMake HDF5 for versions newer than 1.10.3 might work correctly, I'll try 1.10.8 when I get a chance tomorrow.

houjun commented 2 years ago

Sounds good, please let me know what you find with 1.10.8.

tpg2114 commented 2 years ago

Hi @houjun -- Collective IO with HDF5 1.10.8 also hangs on the 31 processor test with my code. Independent IO has no problem and writes the file correctly then moves into the solver iterations.

I can't seem to reproduce it with the HDF5Benchmark test, though.

houjun commented 2 years ago

@tpg2114 could you print the following values before the hanging H5Dwrite: myProc, level, ch_offset[0], hs_procsize[0], hs_allprocsize[0]

tpg2114 commented 2 years ago

@houjun Sure, here it is:

H5DWrite: 0    2  0              46080 1428480
H5DWrite: 11  2  493056    27648 1428480
H5DWrite: 12  2  520704    27648 1428480
H5DWrite: 20  2  921600    36864 1428480
H5DWrite: 8    2  368640    41472 1428480
H5DWrite: 9    2  410112    41472 1428480
H5DWrite: 10  2  451584    41472 1428480
H5DWrite: 16  2  728064    41472 1428480
H5DWrite: 19  2  880128    41472 1428480
H5DWrite: 26  2  1198080  46080 1428480
H5DWrite: 30  2  1382400  46080 1428480
H5DWrite: 28  2  1290240  46080 1428480
H5DWrite: 4    2  184320    46080 1428480
H5DWrite: 2    2  92160      46080 1428480
H5DWrite: 6    2  276480    46080 1428480
H5DWrite: 18  2  824832    55296 1428480
H5DWrite: 22  2  1013760  46080 1428480
H5DWrite: 24  2  1105920  46080 1428480
H5DWrite: 29  2  1336320  46080 1428480
H5DWrite: 1    2  46080      46080 1428480
H5DWrite: 3    2  138240    46080 1428480
H5DWrite: 5    2  230400    46080 1428480
H5DWrite: 7    2  322560    46080 1428480
H5DWrite: 13  2  548352    46080 1428480
H5DWrite: 17  2  769536    55296 1428480
H5DWrite: 23  2  1059840  46080 1428480
H5DWrite: 25  2  1152000  46080 1428480
H5DWrite: 27  2  1244160  46080 1428480
H5DWrite: 14  2  594432    69120 1428480
H5DWrite: 21  2  958464    55296 1428480
H5DWrite: 15  2  663552    64512 1428480

If I were to call amrex_print(vars(level)%ba) and likewise for the distromap and gave those to you, could those be read in the benchmark test to maybe reproduce the issue there?

houjun commented 2 years ago

@tpg2114 Thanks! These should be enough to reproduce the issue since everything seems to be fine until H5Dwrite. I have created a simple HDF5 program that uses the offset and size you provided. I ran it with 31 ranks and it works with both dxpl_col and dxpl_ind, so I'm still not sure why you got the hang. Could you try running this code on the machine where you ran your application? code

tpg2114 commented 2 years ago

@houjun This MWE does in fact hang on my machine with 31 processors using both HDF5 1.12.1 and HDF5 1.10.8.

I will try this on a few different HPC machines and see if I can identify whether it is compiler-related or MPI-related. I am using the new-style LLVM-based Intel compilers (2021.4) with OpenMPI 4.0.3, with GNU 11.2.1 providing Intel with the standard libraries. I think I have access to enough other combinations of things on HPC that I can isolate each of those.

I appreciate the help!

tpg2114 commented 2 years ago

Okay -- so I've run it on several machines now and it appears the problem is in OpenMPI.

I've built and run with MPICH on my machine with the same compilers as the case that hangs, and it works just fine.

I guess I'm going to replace OpenMPI in our builds when MPI isn't already installed. I'm happy to help debug more if there are possible environment variable settings or mpirun options for OpenMPI that might shed some light on it.

houjun commented 2 years ago

@tpg2114 Hearing about OpenMPI being the cause of the hanging issue reminds me that I had some issues with OpenMPI using the ROMIO driver, could you try running the program with ompio driver and see if it still hangs: mpirun --mca io ompio -np 31

tpg2114 commented 2 years ago

@houjun It does hang with mpirun --mca io ompio -np 31, and looking at the documentation, the io ompio option is the default driver for my build. So I tried it with mpirun --mca io romio321 -np 31, which is the only other IO MCA my build has.

That does not hang.

So the problem seems to be in the OMPIO driver specifically, whereas the ROMIO driver works correctly.


Further testing reveals it is the fcoll MCA that is the problem. With the OMPIO driver, it defaults to the vulcan fcoll parameter. When I run with the OMPIO driver with --mca fcoll dynamic_gen2, --mca fcoll individual, or --mca fcoll two_phase, it works fine and doesn't hang. If I run with --mca fcoll vulcan, it hangs.

tpg2114 commented 2 years ago

And with those search terms in hand, it looks like this is a known bug in OpenMPI:

https://www.mail-archive.com/users@lists.open-mpi.org/msg33511.html

although it was marked fixed in 4.0.3 here, but it doesn't seem to be fixed for me.

houjun commented 2 years ago

OK, I probably remembered it wrong and it was fine to use romio but not ompio. I'll open an issue in the HDF5 github repo and see if they know anything.