ecmwf-ifs / ecrad

ECMWF atmospheric radiation scheme
https://confluence.ecmwf.int/display/ECRAD
Apache License 2.0
49 stars 35 forks source link

Compilation errors in parallel builds #19

Open 7schroet opened 7 months ago

7schroet commented 7 months ago

When compiling ecrad with multiple jobs (i.e. make -j 16), the compilation fails sometimes.

Test setup

The tests were conducted with gfortran v11.2.0 and multiple Intel compilers (ifort v2021.5.0, ifort v2021.10.0, ifx v2023.2.0). All tests used a version of NetCDF-Fortran v4.5.3 built with the respective compilers. The following script was run:

for i in `seq 1 22`; do  
make [PROFILE=intel] -j $i |& tee make_${i}.log
make clean
done

Errors

Some compilation processes failed, usually for job counts >= 8. The error message was the following with gfortran:

radiation_spectral_definition.F90:972:9:

  972 |     use radiation_constants, only : SpeedOfLight, BoltzmannConstant, PlanckConstant
      |         1   
Fatal Error: Cannot open module file 'radiation_constants.mod' for reading at (1): No such file or directory
compilation terminated.

For the intel compilers, the error is similar:

radiation_spectral_definition.F90(972): error #7005: Error in reading the compiled module file.   [RADIATION_CONSTANTS]
    use radiation_constants, only : SpeedOfLight, BoltzmannConstant, PlanckConstant
--------^
radiation_spectral_definition.F90(982): error #6406: Conflicting attributes or multiple declaration of name.   [SPEEDOFLIGHT]
      freq = 100.0_jprd * real(SpeedOfLight,jprd) * real(wavenumber,jprd)
-------------------------------^
radiation_spectral_definition.F90(982): warning #7319: This argument's data type is incompatible with this intrinsic procedure; procedure assumed EXTERNAL.   [REAL]  
      freq = 100.0_jprd * real(SpeedOfLight,jprd) * real(wavenumber,jprd)
-------------------------------^
radiation_spectral_definition.F90(982): error #6404: This name does not have a type, and must have an explicit type.   [REAL]  
      freq = 100.0_jprd * real(SpeedOfLight,jprd) * real(wavenumber,jprd)
--------------------------^
radiation_spectral_definition.F90(982): warning #8889: Explicit interface or EXTERNAL declaration is required.   [REAL]
      freq = 100.0_jprd * real(SpeedOfLight,jprd) * real(wavenumber,jprd)
----------------------------------------------------^
radiation_spectral_definition.F90(982): error #7137: Any procedure referenced in a PURE procedure, including one referenced via a defined operation or assignment, must have an explicit interface and be declared PURE.   [REAL]  
      freq = 100.0_jprd * real(SpeedOfLight,jprd) * real(wavenumber,jprd)
[...]

Likely causes/solutions

Since this error only occurs in parallel builds, it seems that some prerequisites in the radiation Makefile were not set properly. A visualization of the dependencies with makefile2graph corroborates that assumption. The compilation errors presented above were resolved by adding radiation_spectral_definition.o: radiation_constants.o.

That led to another missing prerequisite showing itself, namely for the radiation_aerosol_optics.o:

radiation_aerosol_optics.F90:497:9:

  497 |     use radiation_aerosol,             only : aerosol_type
      |         1
Fatal Error: Cannot open module file 'radiation_aerosol.mod' for reading at (1): No such file or directory
compilation terminated.

Adding the prerequisite radiation_aerosol_optics.o: radiation_aerosol.o to the Makefile solves that as well.

reuterbal commented 4 months ago

Thanks - I was unable to reproduce the issue but these are spurious race conditions. The dependency was definitely missing from the Makefile and has been added in #22.