ecmwf / fckit

A Fortran toolkit for interoperating Fortran with C/C++
https://confluence.ecmwf.int/display/fckit
Apache License 2.0
29 stars 13 forks source link

Issues with auto-detected HAVE_FINAL with latest Intel compilers 2021.9.0 #28

Open climbfuji opened 11 months ago

climbfuji commented 11 months ago

What happened?

We are seeing problems with the latest Intel classic compilers (icc@2021.9.0, icpc@2021.9.0; part of Intel's oneAPI@2023.1.0) on a new machine called Hercules (Mississippi State University / NOAA; Rocky Linux release 9.1 (Blue Onyx)).

Here is an example traceback for one of our ctests fv3jedi_test_tier1_hyb-fgat_fv3lm (cont'd below).

1465: Info     : --- Configuration parameters
1465: Info     :        General:{"color log":false,"testing":false,"default seed":true,"reproducibility operators":true,"reproducibility threshold":9.9999999999999998e-13,"universe length-scale":20015806.220738243,"sampling method":"potential"}
1465: Info     :        I/O:{"data directory":".","files prefix":"Data/bump/fv3jedi_bumpparameters_nicas_geos","new files":true,"parallel netcdf":true,"io tasks":20,"alias":[{"in code":"eastward_wind","in file":"fixed_2500km_0.3"},{"in code":"northward_wind","in file":"fixed_25
00km_0.3"},{"in code":"air_temperature","in file":"fixed_2500km_0.3"},{"in code":"specific_humidity","in file":"fixed_2500km_0.3"},{"in code":"cloud_liquid_ice","in file":"fixed_2500km_0.3"},{"in code":"cloud_liquid_water","in file":"fixed_2500km_0.3"},{"in code":"mole_fraction
_of_ozone_in_air","in file":"fixed_2500km_0.3"},{"in code":"surface_pressure","in file":"fixed_2500km"}],"overriding sampling file":"","overriding vertical covariance file":[],"overriding vertical balance file":"","overriding moments file":[],"overriding lowres moments file":[]
,"overriding universe radius file":"","overriding nicas file":"","overriding psichitouv file":"","gsi data file":"","gsi namelist":""}
1465: Info     :        Drivers:{"compute covariance":false,"compute lowres covariance":false,"compute correlation":false,"compute lowres correlation":false,"compute localization":false,"compute lowres localization":false,"compute hybrid weights":false,"hybrid source":"","multi
variate strategy":"univariate","compute normality":false,"read local sampling":false,"read global sampling":false,"write local sampling":false,"write global sampling":false,"write sampling grids":false,"compute vertical covariance":false,"read vertical covariance":false,"write
vertical covariance":false,"compute vertical balance":false,"read vertical balance":false,"write vertical balance":false,"compute variance":false,"compute moments":false,"read moments":false,"write moments":false,"write diagnostics":false,"write diagnostics detail":false,"read
universe radius":false,"write universe radius":false,"compute nicas":false,"read local nicas":true,"read global nicas":false,"write local nicas":false,"write global nicas":false,"write nicas grids":false,"write nicas steps":false,"compute psichitouv":false,"read local psichitou
v":false,"write local psichitouv":false,"vertical balance inverse test":false,"adjoints test":false,"normalization test":0,"internal dirac test":false,"randomization test":false,"internal consistency test":false,"localization optimality test":false,"interpolate from gsi data":f
alse}
1465: Info     :        Model:{"variables":["eastward_wind","northward_wind","air_temperature","specific_humidity","cloud_liquid_ice","cloud_liquid_water","mole_fraction_of_ozone_in_air"],"2d variables":[],"do not cross mask boundaries":false,"level for 2d variables":"first","n
l0":72,"lev2d":"first"}
1465: Info     :        Ensemble sizes:{"total ensemble size":0,"sub-ensembles":1,"total lowres ensemble size":0,"lowres sub-ensembles":1}
1465: Info     :        Sampling:{"computation grid size":0,"diagnostic grid size":0,"distance classes":0,"angular sectors":1,"distance class width":0,"reduced levels":0,"local diagnostic":false,"averaging length-scale":0,"averaging latitude width":0,"grid type":"random","max n
umber of draws":10000,"interpolation type":"si","masks":[],"contiguous levels threshold":0}
1465: Info     :        diagnostics:[hercules-login-2:1617504:0:1617504] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
1465: {"target ensemble size":0,"target lowres ensemble size":0,"gaussian approximation":false,"generalized kurtosis threshold":1.7976931348623157e+308,"histogram bins":0,"diagnosed lengths scaling":1}
1465: Info     :        Vertical balance:{"vbal":[],"pseudo inverse":false,"dominant mode":0,"variance threshold":0,"identity blocks":false}
1465: [hercules-login-2:1617503:0:1617503] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
1465: [hercules-login-2:1617507:0:1617507] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
1465: [hercules-login-2:1617508:0:1617508] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
1465: [hercules-login-2:1617506:0:1617506] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
1465: [hercules-login-2:1617505:0:1617505] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
1465: ==== backtrace (tid:1617503) ====
1465:  0 0x0000000000054d90 __GI___sigaction()  :0
1465:  1 0x0000000000038027 fckit_shared_ptr_module_mp_owners_()  /work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-dev-20230717/cache/build_stage/spack-stage-fckit-0.10.1-vnk44k7j7qo22dgxy7smdoiud35jk23z/spack-src/src/fckit/module/fckit_shared_ptr.F90:263
1465:  2 0x0000000000037c8d fckit_shared_ptr_module_mp_fckit_shared_ptr__final_auto_()  /work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-dev-20230717/cache/build_stage/spack-stage-fckit-0.10.1-vnk44k7j7qo22dgxy7smdoiud35jk23z/spack-src/src/fckit/module/fckit_shared_ptr.F90:138
1465:  3 0x000000000047a40f for_finalize()  ???:0
1465:  4 0x000000000047a5e6 for_finalize()  ???:0
1465:  5 0x000000000047a5e6 for_finalize()  ???:0
1465:  6 0x000000000003e299 fckit_configuration_module_mp_deallocate_fckit_configuration_()  /work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-dev-20230717/cache/build_stage/spack-stage-fckit-0.10.1-vnk44k7j7qo22dgxy7smdoiud35jk23z/spack-src/src/fckit/module/fckit_configuration.F90:340
1465:  7 0x000000000003d998 fckit_configuration_module_mp_get_config_list_()  /work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-dev-20230717/cache/build_stage/spack-stage-fckit-0.10.1-vnk44k7j7qo22dgxy7smdoiud35jk23z/spack-src/src/fckit/module/fckit_configuration.F90:599
1465:  8 0x0000000000d0cf53 type_nam_mp_nam_from_conf_()  /work2/noaa/da/dheinzel-new/skylab-test-20230717-hercules/build-release/saber/src/saber/bump/lib//work2/noaa/da/dheinzel-new/skylab-test-20230717-hercules/jedi-bundle/saber/src/saber/bump/lib/type_nam.fypp:640
1465:  9 0x0000000000b1cb3c type_bump_mp_bump_create_()  /work2/noaa/da/dheinzel-new/skylab-test-20230717-hercules/build-release/saber/src/saber/bump/lib//work2/noaa/da/dheinzel-new/skylab-test-20230717-hercules/jedi-bundle/saber/src/saber/bump/lib/type_bump.fypp:129
1465: 10 0x0000000000755ae5 bump_create_f90()  /work2/noaa/da/dheinzel-new/skylab-test-20230717-hercules/jedi-bundle/saber/src/saber/bump/lib/type_bump_interface.F90:76
1465: 11 0x000000000074777b bump_lib::BUMP::BUMP()  /work2/noaa/da/dheinzel-new/skylab-test-20230717-hercules/jedi-bundle/saber/src/saber/bump/lib/BUMP.cc:180
1465: 12 0x00000000007c1fe5 saber::bump::NICAS::NICAS()  /work2/noaa/da/dheinzel-new/skylab-test-20230717-hercules/jedi-bundle/saber/src/saber/bump/NICAS.cc:68
1465: 13 0x000000000089e800 std::make_unique<saber::bump::NICAS, oops::GeometryData const&, oops::Variables const&, eckit::Configuration const&, saber::bump::NICASParameters const&, atlas::FieldSet const&, atlas::FieldSet const&, util::DateTime const&, unsigned long const&>()  /usr/include/c++/11/bits/unique_ptr.h:962
1465: 14 0x000000000089e800 saber::SaberCentralBlockMaker<saber::bump::NICAS>::make()  /work2/noaa/da/dheinzel-new/skylab-test-20230717-hercules/jedi-bundle/saber/src/saber/../saber/oops/SaberCentralBlockBase.h:197
1465: 15 0x000000000094de6f saber::SaberCentralBlockFactory::create()  /work2/noaa/da/dheinzel-new/skylab-test-20230717-hercules/jedi-bundle/saber/src/saber/oops/SaberCentralBlockBase.cc:68
1465: 16 0x000000000094d154 saber::SaberBlockChain::centralBlockInit()  /work2/noaa/da/dheinzel-new/skylab-test-20230717-hercules/jedi-bundle/saber/src/saber/oops/SaberBlockChain.cc:37
1465: 17 0x00000000006e7da4 saber::buildCentralBlock<fv3jedi::Traits>()  /work2/noaa/da/dheinzel-new/skylab-test-20230717-hercules/jedi-bundle/saber/src/saber/../saber/oops/Utilities.h:699
1465: 18 0x00000000006f5140 saber::ErrorCovariance<fv3jedi::Traits>::ErrorCovariance()  /work2/noaa/da/dheinzel-new/skylab-test-20230717-hercules/jedi-bundle/saber/src/saber/../saber/oops/ErrorCovariance.h:299
1465: 19 0x00000000006fcf4d oops::CovarMaker<fv3jedi::Traits, saber::ErrorCovariance<fv3jedi::Traits> >::make()  /work2/noaa/da/dheinzel-new/skylab-test-20230717-hercules/jedi-bundle/oops/src/oops/base/ModelSpaceCovarianceBase.h:233
1465: 20 0x0000000000586131 oops::CovarianceFactory<fv3jedi::Traits>::create()  /work2/noaa/da/dheinzel-new/skylab-test-20230717-hercules/jedi-bundle/oops/src/oops/base/ModelSpaceCovarianceBase.h:281
1465: 21 0x0000000000585e24 oops::CovarianceFactory<fv3jedi::Traits>::create()  /work2/noaa/da/dheinzel-new/skylab-test-20230717-hercules/jedi-bundle/oops/src/oops/base/ModelSpaceCovarianceBase.h:295
1465: 22 0x00000000005855f3 oops::HybridCovariance<fv3jedi::Traits>::HybridCovariance()  /work2/noaa/da/dheinzel-new/skylab-test-20230717-hercules/jedi-bundle/oops/src/oops/base/HybridCovariance.h:77
1465: 23 0x00000000005855f3 std::__uniq_ptr_data<oops::ModelSpaceCovarianceBase<fv3jedi::Traits>, std::default_delete<oops::ModelSpaceCovarianceBase<fv3jedi::Traits> >, true, true>::__uniq_ptr_impl()  /usr/include/c++/11/bits/unique_ptr.h:210
1465: 24 0x00000000005855f3 std::unique_ptr<oops::ModelSpaceCovarianceBase<fv3jedi::Traits>, std::default_delete<oops::ModelSpaceCovarianceBase<fv3jedi::Traits> > >::unique_ptr<std::default_delete<oops::ModelSpaceCovarianceBase<fv3jedi::Traits> >, void>()  /usr/include/c++/11/bits/unique_ptr.h:281
1465: 25 0x00000000005855f3 oops::HybridCovariance<fv3jedi::Traits>::HybridCovariance()  /work2/noaa/da/dheinzel-new/skylab-test-20230717-hercules/jedi-bundle/oops/src/oops/base/HybridCovariance.h:77
1465: 26 0x0000000000585364 oops::CovarMaker<fv3jedi::Traits, oops::HybridCovariance<fv3jedi::Traits> >::make()  /work2/noaa/da/dheinzel-new/skylab-test-20230717-hercules/jedi-bundle/oops/src/oops/base/ModelSpaceCovarianceBase.h:233
1465: 27 0x0000000000586131 oops::CovarianceFactory<fv3jedi::Traits>::create()  /work2/noaa/da/dheinzel-new/skylab-test-20230717-hercules/jedi-bundle/oops/src/oops/base/ModelSpaceCovarianceBase.h:281
1465: 28 0x0000000000585e24 oops::CovarianceFactory<fv3jedi::Traits>::create()  /work2/noaa/da/dheinzel-new/skylab-test-20230717-hercules/jedi-bundle/oops/src/oops/base/ModelSpaceCovarianceBase.h:295
1465: 29 0x0000000000591a0e oops::CostJb3D<fv3jedi::Traits>::linearize()  /work2/noaa/da/dheinzel-new/skylab-test-20230717-hercules/jedi-bundle/oops/src/oops/assimilation/CostJb3D.h:126
1465: 30 0x0000000000560b8d oops::CostJbTotal<fv3jedi::Traits, ufo::ObsTraits>::computeCostTraj()  /work2/noaa/da/dheinzel-new/skylab-test-20230717-hercules/jedi-bundle/oops/src/oops/assimilation/CostJbTotal.h:233
1465: 31 0x000000000055c6a3 oops::CostFunction<fv3jedi::Traits, ufo::ObsTraits>::evaluate()  /work2/noaa/da/dheinzel-new/skylab-test-20230717-hercules/jedi-bundle/oops/src/oops/assimilation/CostFunction.h:258
1465: 32 0x000000000055c6a3 oops::CostFunction<fv3jedi::Traits, ufo::ObsTraits>::evaluate()  /work2/noaa/da/dheinzel-new/skylab-test-20230717-hercules/jedi-bundle/oops/src/oops/assimilation/CostFunction.h:259
1465: 33 0x0000000000557289 oops::IncrementalAssimilation<fv3jedi::Traits, ufo::ObsTraits>()  /work2/noaa/da/dheinzel-new/skylab-test-20230717-hercules/jedi-bundle/oops/src/oops/assimilation/IncrementalAssimilation.h:67
1465: 34 0x0000000000557289 oops::IncrementalAssimilation<fv3jedi::Traits, ufo::ObsTraits>()  /work2/noaa/da/dheinzel-new/skylab-test-20230717-hercules/jedi-bundle/oops/src/oops/assimilation/IncrementalAssimilation.h:67
1465: 35 0x0000000000555887 oops::Variational<fv3jedi::Traits, ufo::ObsTraits>::execute()  /work2/noaa/da/dheinzel-new/skylab-test-20230717-hercules/jedi-bundle/oops/src/oops/runs/Variational.h:79
1465: 36 0x00000000000fa31d oops::Run::execute()  /work2/noaa/da/dheinzel-new/skylab-test-20230717-hercules/jedi-bundle/oops/src/oops/runs/Run.cc:182
1465: 37 0x00000000004acc71 main()  /work2/noaa/da/dheinzel-new/skylab-test-20230717-hercules/jedi-bundle/fv3-jedi/src/mains/fv3jediVar.cc:25
1465: 38 0x000000000003feb0 __libc_start_call_main()  ???:0
1465: 39 0x000000000003ff60 __libc_start_main_alias_2()  :0
1465: 40 0x00000000004acb05 _start()  ???:0

The automatic detection of HAVE_FINAL in fckit's cmake config enables the feature. If I turn off the FINAL feature manually in the cmake build, the above errors are gone.

As part of my investigation I spent a bit of time cruising around in the fckit code and cmake files and I found several places that hint to problems with the FINAL feature in fckit itself. For example, there are build flags like FCKIT_FINAL_BROKEN_FOR_ALLOCATABLE_ARRAY and FCKIT_FINAL_BROKEN_FOR_AUTOMATIC_ARRAY that simply turn off related tests etc.

I am not entirely sure what the reasoning behind relying on auto-detecting the FINAL feature by default is, but from all that I've seen and not knowing what you know, I would expect the default for HAVE_FINAL to be OFF, and any user setting it to ON should be greeted with big bold warnings?

What are the steps to reproduce the bug?

The only system on which I currently have access to this particular version of oneAPI is MSU Hercules, therefore I cannot say if it is just the compiler or a combination of the compiler PLUS the machine. A slightly earlier version of Intel (2021.8.0, part of oneAPI@2023.0.0) did not have this problem with the finalization of the shared pointers (but had other problems that were indeed bugs in the compiler that forced us to move on to the latest version).

Version

0.10.1

Platform (OS and architecture)

Linux hercules-login-4.hpc.msstate.edu 5.14.0-162.12.1.el9_1.0.2.x86_64 #1 SMP PREEMPT_DYNAMIC Mon Jan 30 22:14:42 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux / Rocky Linux release 9.1 (Blue Onyx)

Relevant log output

see above

Accompanying data

n/a

Organisation

JCSDA

wdeconinck commented 10 months ago

Hi @climbfuji I have also come to realise that the FINAL feature should probably not be enabled by default anymore. Different compilers tend to behave differently and it is hard to predict all things that could go wrong.

It has in my recent experience also caused some issues. Also I should add some more extensive tests to at least detect errors of the sort you described with ENABLE_FINAL=ON. I am ok to disable this feature by default.

I can make this change of default, including a warning when enabling, for the next release.

It should be noted though that memory leaks will occur if "final" method is then not called manually.

climbfuji commented 10 months ago

That's great to hear, thanks very much @wdeconinck !

climbfuji commented 5 months ago

@wdeconinck Any updates on this issue? So far we entertain logic in the spack recipe for fckit to turn off AUTO_FINAL if the compiler is intel@2021.8.0 or later and it's trying to autodetect the setting. If the user turns it on and uses intel@2021.8.0` or later, spack will abort.