Fix crash with bufr2nc_fortran.x on macOS

srherbener commented 10 months ago

Description

This PR fixes crashing issues (found with ctests) of the Fortan executable bufr2nc_fortran.x on macOS. There were two issues:

Very large global static arrays (on the order of 3GB allocation!) in hsd_mod.f90
Accessing past the bounds of an array in ncio_mod.f90 (write_obs subroutine)

Issue(s) addressed

Partially addresses #1378

Dependencies

List the other PRs that this PR is dependent on: None

Impact

Expected impact on downstream repositories: None, no functional changes.

Checklist

[x] I have performed a self-review of my own code
[x] I have made corresponding changes to the documentation (N/A)
[x] I have run the unit tests before creating the PR
[x] Test on HPC platform - Orion

PatNichols commented 10 months ago

@srherbener On the mac intel version this is fixed using memory model flags when clang is being used with gfortran. It's worth investigating if the mismatch between mac intel and mac arm might be problematic for multiple things. Can you check your compilers to see if the -mcmodule flags is being passed ( run ccmake ../ then toggle to see the compiler flags is the easiest way).

srherbener commented 10 months ago

@PatNichols thanks for the suggestion about the memory model option (mcmodel). As you mention I think it could be useful for some of the issues we see with macOS.

The apple-clang compiler accepts the -mcmodel=large option so I gave that a try and I unfortunately still got the fortran executable crash. The issue in this case is that the macOS allocates a shared library region in the executable memory footprint, and if the programs data allocation goes over 2GB it will encroach on that region. The bufr2nc_fortran.x code allocates ~3GB of static arrays that end up going past that 2GB limit. There seems to be consensus on reports I found searching on the web that the fix is to reduce static array allocations.

However, I can see that using a different memory model should help as well so perhaps I'm doing something wrong. I built from scratch adding -mcmodel=large to the CMAKE_CXX_FLAGS and CMAKE_C_FLAGS defines in the ecbuild command line. Perhaps the issue is that this is a Fortran only code so there is not a way for the apple-clang options to impact the executable memory model. Or perhaps because of the macOS scheme with the shared library region, the mcmodel option does not help.

Another aspect of this is that the bufr2nc_fortran.x source code allocates 3GB of static arrays and then never uses them. I checked the code and the only routine that uses those arrays is the read_HSD subroutine in the hsd_mod.f90 source file. So, it seemed appropriate to make those arrays local to read_HSD and make them allocatable (ie, use the heap). If the intent is to eventually make those arrays accessible from the caller, then it seems using giant global static arrays is a questionable design especially since there are local arrays in other routine that match the names of some of the arrays in hsd_mod.f90 (eg, longitude, latitude). It might be best to keep these arrays allocatable and local to the read_HSD routine. What do you think?

PatNichols commented 10 months ago

@srherbener the flag for mcmodel allows any a stack size though medium and large handle this differently. So allocation of arrays should not be needed at all. The clang compiler on an intel mac uses mcmodel=large by default. One has to explicitly tell it to not to do this. Since this code works fine on intel mac but crashes on your mac arm chip my view on this is allocating array is just putting a bandaid that may be much deeper problem(you should not have to do this). My first guess that we messed up the instructions for setting up jedi on the M1/M2 chips. Given the weirdness with the float tolerance it seems likely. This is creating a "very technical" debt problem that we should fix.

srherbener commented 10 months ago

@PatNichols I think you are voicing a valid concern about masking a deeper problem and I agree we should have a solution that doesn't do this.

I'm not sure I understand your argument about unlimited stack size and allocation of arrays not being needed. You are talking about allocating arrays on the heap not being needed, correct? But this issue is not related to the stack, it's related to the data footprint size of the executable after it gets loaded into memory.

A couple other points:

I think allocating 3GB of static arrays in the executable's data footprint (as well as the stack) is not a good practice. I think it makes more sense to use the heap for gigantic arrays.
Using global arrays is also not a good practice because it makes the code fragile and makes maintainability and supportability much more difficult.

Because of these points I think moving the arrays to the heap is helping with some aspects of technical debt.

However, I realize this doesn't address your concern about hiding a deeper problem.

I'll add @climbfuji as a reviewer to see if he can shed some light on this topic.

BenjaminRuston commented 10 months ago

@ashley314 can you try this with an ingest suite should be lateral move

climbfuji commented 10 months ago

@srherbener the flag for mcmodel allows any a stack size though medium and large handle this differently. So allocation of arrays should not be needed at all. The clang compiler on an intel mac uses mcmodel=large by default. One has to explicitly tell it to not to do this. Since this code works fine on intel mac but crashes on your mac arm chip my view on this is allocating array is just putting a bandaid that may be much deeper problem(you should not have to do this). My first guess that we messed up the instructions for setting up jedi on the M1/M2 chips. Given the weirdness with the float tolerance it seems likely. This is creating a "very technical" debt problem that we should fix.

I don't understand this argument about setting up JEDI on the M1/M2 chips. Can you explain what you mean?

I think @srherbener's suggestion to remove BAD global module variables on the heap and make them allocatable in the code that uses them (and only there) is the right thing to do.

ashley314 commented 10 months ago

@ashley314 can you try this with an ingest suite should be lateral move

@BenjaminRuston Yup, I can this week!

ashley314 commented 10 months ago

List of ctest that failed on my M2 mac using i386, I'll run the converter on some ingest data too and let you know how it goes! Thanks @srherbener

The following tests FAILED:
     16 - test_util_signal_trap (Failed)
    351 - saber_test_error_covariance_training_bump_hdiag-nicas_2_1-1 (Failed)
    446 - saber_test_error_covariance_training_bump_hdiag-nicas_2_2-1 (Failed)
    1132 - ufo_test_tier1_test_ufo_qc_variableassignment (Failed)
    1260 - ufo_test_tier1_test_ufo_opr_gnssrorefmetoffice (Failed)
    1264 - ufo_test_tier1_test_ufo_opr_gnssrobendmetoffice_nopseudo (Failed)
    1272 - ufo_test_tier1_test_ufo_opr_groundgnssmetoffice (Failed)
    1327 - ufo_test_tier1_test_ufo_opr_scatwind_neutral_metoffice (Failed)
    1561 - test_iodaconv_bufr_ncep_1bamua2ioda (Failed)
    1562 - test_iodaconv_bufr_ncep_1bamua2ioda_n15 (Failed)
    1828 - fv3jedi_test_tier1_errorcovariance (Failed)
    1874 - fv3jedi_test_tier1_errorcovariance_bump (Failed)
    1905 - fv3jedi_test_tier1_gen_hybrid_linear_model_coeffs (Failed)
    1938 - test_soca_errorcovariance (Failed)

BenjaminRuston commented 10 months ago

@ashley314 yes the test of the converter in your ingest-observations.yaml Skylab experiment would be useful

@srherbener can likely tell you which of those are expected.. that looks like a full jedi-bundle set of tests correct, looks like just the 1561 and 1562 are iodaconv tests and those are ones that use the NCAR obs2ioda ;0)

PatNichols commented 10 months ago

@climbfuji the only place we see this crash is on the apple arm chips, m1 and m2. There was a previous PR dealing with another issue that only effected the apple arm chips. Unlike the intel mac builds there is a many ways to mess up your spack stack and your environment in general. I am not sure that @srherbener is understanding what I am saying but the crash should not happen when the mcmodel flag is present. 1) there is a bug in gfortran or on this particular chip gfortran does not play well with clang. 2) there is a mixup in which abi is being run. For example if you invoke clang from zsh then it will run arch64 if that is the default for that shell..in other words being able to run both an emulated intel or pure arm version can trip you up. 3) perhaps something else. I am not against merging this but I would like a better understanding of why it was needed in the first place.

ashley314 commented 10 months ago

FYI - I had no issues running with these changes when converting prepbufr and MHS data

climbfuji commented 10 months ago

@climbfuji the only place we see this crash is on the apple arm chips, m1 and m2. There was a previous PR dealing with another issue that only effected the apple arm chips. Unlike the intel mac builds there is a many ways to mess up your spack stack and your environment in general. I am not sure that @srherbener is understanding what I am saying but the crash should not happen when the mcmodel flag is present.

there is a bug in gfortran or on this particular chip gfortran does not play well with clang.

there is a mixup in which abi is being run. For example if you invoke clang from zsh then it will run arch64 if that is the default for that shell..in other words being able to run both an emulated intel or pure arm version can trip you up.

perhaps something else. I am not against merging this but I would like a better understanding of why it was needed in the first place.

I don't disagree with you (is this another way of saying that I agree with you?), but I think the changes proposed by @srherbener make sense regardless and should be merged. But if you think that it's better to keep the current code for a bit longer because you are planning to investigate why the mcmodel flag doesn't work as expected, then that's ok with me. I don't think INFRA has time to look into this at the moment, though.

srherbener commented 10 months ago

Here is a discussion on the Apple developer site about the issue with using large static arrays. There is a quite reasonable explanation why you see the crash on an arm64 mac and you don't see a crash on an intel Mac.

https://developer.apple.com/forums/thread/676684

I'm not familiar enough with what mcmodel does but perhaps you could convince the arm64 OS to load the executable image at a lower address, but in that case you would be circumventing the security protections (that were gained by loading at 0x100000000) and I think that would be the wrong thing to do.

climbfuji commented 10 months ago

Thanks for finding this @srherbener ! This makes a lot of sense to me. So in a sense Apple's arm64 helps identify bad code and forces the developers to fix it :-) I think this PR should get merged based on those findings.

PatNichols commented 10 months ago

@srherbener @climbfuji Did some Stack overflow browsing. Apparently the llvm/clang/clang++ system has changed from intel mac to m1 mac. m1 mac no longer uses mcmodel=large unless you are compiling a shared library (even there I am not sure). So it's basically a compiler change introduced by some of the security problems with the M1 chip kernel. The compiler should fail to compile or link the code. For example on gnu with linux where there is the same problem with the 2GB limit the code will not link without the mcmodel flag. You get an error like so:/home/ubuntu/ioda_b_mcmodel/iodaconv/src/ncar-bufr2nc-fortran/main.f90:60:(.text+0x1d): relocation truncated to fit: R_X86_64_PC32 against symbol __prepbufr_mod_MOD_do_tv_to_ts' defined in .bss section in CMakeFiles/bufr2nc_fortran.x.dir/prepbufr_mod.f90.o /home/ubuntu/ioda_b_mcmodel/iodaconv/src/ncar-bufr2nc-fortran/main.f90:304:(.text+0x387): relocation truncated to fit: R_X86_64_PC32 against symbol__prepbufr_mod_MOD_do_tv_to_ts' defined in .bss section in CMakeFiles/bufr2nc_fortran.x.dir/prepbufr_mod.f90.o

Thus the linker should throw an error in a perfect world not link the code and crash. Given that we might as well merge this.

JCSDA-internal / ioda-converters