geoschem / GCHP

The "superproject" wrapper repository for GCHP, the high-performance instance of the GEOS-Chem chemical-transport model.
https://gchp.readthedocs.io
Other
23 stars 25 forks source link

Configuring incomplete, errors occurred #370

Closed TianYangMY closed 7 months ago

TianYangMY commented 9 months ago

Name and Institution (Required)

Name: lish Institution: SUN YAT-SEN UNIVERSITY

Description of your issue or question

When Configuring with cmake, the following error occurs.

CMake Error at ESMA_cmake/compiler/flags/Intel_Fortran.cmake:59 (message): Unknown processor. Contact Matt Thompson Call Stack (most recent call first): ESMA_cmake/compiler/esma_compiler.cmake:9 (include) ESMA_cmake/esma.cmake:48 (include) CMakeLists.txt:55 (include)

-- Configuring incomplete, errors occurred!

Please provide as much detail as possible. Always include the GCHP version number and any relevant configuration and log files. GCHP v14.2.3 module load compiler/intel/2021.3.0 module load mpi/intelmpi/2021.3.0 module load mathlib/hdf5/1.12.0-intel-2021 module load mathlib/netcdf/4.6.1-intel-2021 module load compiler/cmake/3.20.4 export CC=icc export CXX=icpc export FC=ifort esmf 8.4.1

lizziel commented 9 months ago

Hi @TianYangMY, I am pinging Matt Thompson on this issue since the error message indicates contacting him. He is a developer of the ESMA_cmake software used within GCHP.

@mathomp4, this error is occurring using ESMA_cmake 3.8.0. We have not had any reports of this before so it must be specific to the configuration of the system @TianYangMY is using. Do you have any ideas what could be wrong? The error is occuring here.

cmake_host_system_information(RESULT proc_decription QUERY PROCESSOR_DESCRIPTION)
if (${proc_decription} MATCHES "EPYC")
   set (COREAVX2_FLAG "-march=core-avx2")
elseif (${proc_decription} MATCHES "Intel")
   set (COREAVX2_FLAG "-march=core-avx2")
   # Previous versions of GEOS used this flag, which was not portable
   # for AMD. Keeping here for a few versions for historical purposes.
   #set (COREAVX2_FLAG "-xCORE-AVX2")
else ()
   message(FATAL_ERROR "Unknown processor. Contact Matt Thompson")
endif ()
mathomp4 commented 9 months ago

@TianYangMY What type of machine are you building on? You must be on something new!

My first thought would be to edit ESMA_cmake/compiler/flags/Intel_Fortran.cmake and add a line after the cmake_host_system_information call to display what CMake thinks your system is:

cmake_host_system_information(RESULT proc_decription QUERY PROCESSOR_DESCRIPTION)
message(STATUS "proc_description: ${proc_description}")
TianYangMY commented 9 months ago

@mathomp4 I used GCHP on the High-performance computing sytem, the processor is X86.

I add a line after the cmake_host_system_information:

cmake_host_system_information(RESULT proc_decription QUERY PROCESSOR_DESCRIPTION)
message(STATUS "proc_description: ${proc_description}")

After running CMake again, still the same error, and CMake doesn't seem to be matching to my system.New information is:

proc_description: 
CMake Error at ESMA_cmake/compiler/flags/Intel_Fortran.cmake:60 (message):
  Unknown processor.  Contact Matt Thompson
Call Stack (most recent call first):
  ESMA_cmake/compiler/esma_compiler.cmake:9 (include)
  ESMA_cmake/esma.cmake:48 (include)
  CMakeLists.txt:55 (include)

111111

mathomp4 commented 9 months ago

Well that is baffling. I have never seen CMake not return something there. Can you try adding this code:

## Print out the processor description
cmake_host_system_information(RESULT proc_description QUERY PROCESSOR_DESCRIPTION)
message(STATUS "Processor description: ${proc_description}")
## Print out the processor name
cmake_host_system_information(RESULT proc_name QUERY PROCESSOR_NAME)
message(STATUS "Processor name: ${proc_name}")
## Print out the processor serial number if HAS_SERIAL_NUMBER is true else print
## out that the processor has no serial number
cmake_host_system_information(RESULT has_serial QUERY HAS_SERIAL_NUMBER)
if (has_serial)
  cmake_host_system_information(RESULT proc_serial QUERY PROCESSOR_SERIAL_NUMBER)
  message(STATUS "Processor serial number: ${proc_serial}")
else ()
  message(STATUS "Processor has no serial number")
endif ()
## Print out CMAKE_HOST_SYSTEM_PROCESSOR
message(STATUS "CMAKE_HOST_SYSTEM_PROCESSOR: ${CMAKE_HOST_SYSTEM_PROCESSOR}")

and see what CMake spits out?

Can you also send the output of:

awk '/^$/ { exit } { print }' /proc/cpuinfo

which should leverage what the system knows.

Beyond that, I suppose a workaround for you would be to put:

   set (COREAVX2_FLAG "-march=core-avx2")

instead of the message(FATAL_ERROR) in:

else ()
   message(FATAL_ERROR "Unknown processor. Contact Matt Thompson")
endif ()

Most modern AMD and Intel chips support core-avx2. (Though it's hard for me to assure that without knowing what the processor is.)

TianYangMY commented 9 months ago

Thank you! @mathomp4
I add the code:

## Print out the processor description
cmake_host_system_information(RESULT proc_description QUERY PROCESSOR_DESCRIPTION)
message(STATUS "Processor description: ${proc_description}")
## Print out the processor name
cmake_host_system_information(RESULT proc_name QUERY PROCESSOR_NAME)
message(STATUS "Processor name: ${proc_name}")
## Print out the processor serial number if HAS_SERIAL_NUMBER is true else print
## out that the processor has no serial number
cmake_host_system_information(RESULT has_serial QUERY HAS_SERIAL_NUMBER)
if (has_serial)
  cmake_host_system_information(RESULT proc_serial QUERY PROCESSOR_SERIAL_NUMBER)
  message(STATUS "Processor serial number: ${proc_serial}")
else ()
  message(STATUS "Processor has no serial number")
endif ()
## Print out CMAKE_HOST_SYSTEM_PROCESSOR
message(STATUS "CMAKE_HOST_SYSTEM_PROCESSOR: ${CMAKE_HOST_SYSTEM_PROCESSOR}")

The output is :

-- Processor description: 64 core Hygon C86 7285 32-core Processor
-- Processor name: Unknown Hygon family
-- Processor has no serial number
-- CMAKE_HOST_SYSTEM_PROCESSOR: x86_64
-- proc_description: 64 core Hygon C86 7285 32-core Processor
CMake Error at ESMA_cmake/compiler/flags/Intel_Fortran.cmake:77 (message):
  Unknown processor.  Contact Matt Thompson
Call Stack (most recent call first):
  ESMA_cmake/compiler/esma_compiler.cmake:9 (include)
  ESMA_cmake/esma.cmake:48 (include)
  CMakeLists.txt:55 (include)

The output of this code awk '/^$/ { exit } { print }' /proc/cpuinfo is :

processor       : 0
vendor_id       : HygonGenuine
cpu family      : 24
model           : 1
model name      : Hygon C86 7285 32-core Processor
stepping        : 1
microcode       : 0x80901047
cpu MHz         : 2000.000
cache size      : 512 KB
physical id     : 0
siblings        : 32
core id         : 0
cpu cores       : 32
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc art rep_good nopl nonstop_tsc extd_apicid amd_dcm aperfmperf eagerfpu pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_l2 hw_pstate sme retpoline_amd ssbd ibpb vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 clzero irperf xsaveerptr arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif overflow_recov succor smca
bogomips        : 3999.44
TLB size        : 2560 4K pages
clflush size    : 64
cache_alignment : 64
address sizes   : 43 bits physical, 48 bits virtual
power management: ts ttp tm hwpstate eff_freq_ro [13] [14]

I changed the code:

 cmake_host_system_information(RESULT proc_decription QUERY PROCESSOR_DESCRIPTION)
message(STATUS "proc_description: ${proc_description}")
if (${proc_decription} MATCHES "EPYC")
   set (COREAVX2_FLAG "-march=core-avx2")
elseif (${proc_decription} MATCHES "Intel")
   set (COREAVX2_FLAG "-march=core-avx2")
   # Previous versions of GEOS used this flag, which was not portable
   # for AMD. Keeping here for a few versions for historical purposes.
   #set (COREAVX2_FLAG "-xCORE-AVX2")
else ()
    set (COREAVX2_FLAG "-march=core-avx2")
#   message(FATAL_ERROR "Unknown processor. Contact Matt Thompson")
endif ()

The configuration was successful !

But We I compiled GCHP, there are a lot of Warnings such as:

 /work/home/lish325/GCHP_14/GCHP/src/FMS/mpp/mpp_efp.F90(764): remark #8291: Recommended relationship between field width 'W' and the number of fractional digits 'D' in this edit descriptor is 'W>=D+7'.
      write (mesg,'("mpp_efp_list_sum_across_PEs error at ",i6," val was ",ES12.6, ", prec_error = ",ES12.6)') &
-----------------------------------------------------------------------------^
/work/home/lish325/GCHP_14/GCHP/src/FMS/mpp/mpp_efp.F90(764): remark #8291: Recommended relationship between field width 'W' and the number of fractional digits 'D' in this edit descriptor is 'W>=D+7'.
      write (mesg,'("mpp_efp_list_sum_across_PEs error at ",i6," val was ",ES12.6, ", prec_error = ",ES12.6)') &
ifort: warning #10182: disabling optimization; runtime debug checks enabled
ifort: warning #10182: disabling optimization; runtime debug checks enabled
ifort: warning #10182: disabling optimization; runtime debug checks enabled
ifort: warning #10182: disabling optimization; runtime debug checks enabled
ifort: warning #10182: disabling optimization; runtime debug checks enabled
[ 64%] Building Fortran object src/fms_r8/CMakeFiles/fms_r8.dir/tracer_manager/tracer_manager.F90.o
[ 64%] Building Fortran object src/fms_r8/CMakeFiles/fms_r8.dir/field_manager/fm_util.F90.o
[ 64%] Building Fortran object src/fms_r8/CMakeFiles/fms_r8.dir/oda_tools/oda_core.F90.o
[ 64%] Building Fortran object src/fms_r8/CMakeFiles/fms_r8.dir/diag_manager/diag_output.F90.o
/work/home/lish325/GCHP_14/GCHP/src/FMS/tracer_manager/tracer_manager.F90(912): remark #8291: Recommended relationship between field width 'W' and the number of fractional digits 'D' in this edit descriptor is 'W>=D+7'.
901 FORMAT(E12.6,1x,E12.6)
------------^
/work/home/lish325/GCHP_14/GCHP/src/FMS/tracer_manager/tracer_manager.F90(912): remark #8291: Recommended relationship between field width 'W' and the number of fractional digits 'D' in this edit descriptor is 'W>=D+7'.
901 FORMAT(E12.6,1x,E12.6)
---------------------^
/work/home/lish325/GCHP_14/GCHP/src/FMS/tracer_manager/tracer_manager.F90(911): remark #8291: Recommended relationship between field width 'W' and the number of fractional digits 'D' in this edit descriptor is 'W>=D+7'.
900 FORMAT(A,2(1x,E12.6))
-------------------^
/work/home/lish325/GCHP_14/GCHP/src/FMS/tracer_manager/tracer_manager.F90(1278): remark #8291: Recommended relationship between field width 'W' and the number of fractional digits 'D' in this edit descriptor is 'W>=D+7'.
  700 FORMAT (3A,E12.6,A,F10.6)

@lizziel I don't know if these warnings affect the proper functioning of GCHP, but in the end the GCHP installed successfully!

[100%] Built target GCHP
[100%] Built target GCHP_GridComp
[100%] Built target gchp
Install the project...
-- Install configuration: "RelWithDebInfo"
mathomp4 commented 9 months ago

Ahhh. A Hygon processor! I know of Hygon but I've never encountered one before!

From your cpuinfo output it should be avx2 compatible, so the flag you used should work. I'll add it to ESMA_cmake as a supported processor for Intel.

ETA: Well, I'll add support once you can confirm things run, @TianYangMY. I'm fairly confident it will, but better to be sure. 😄

lizziel commented 9 months ago

Hurray! @TianYangMY, many warnings during build is normal for GCHP. There is actually a separate issue open about trying to suppress them: https://github.com/geoschem/GCHP/issues/322. For now you can just ignore them.

TianYangMY commented 9 months ago

Thank you !@mathomp4 @lizziel I tried to run GCHP, but encountered a strange problem! I used the following script to submit GCHP to run:

#!/bin/bash
#
#SBATCH -n 96
#SBATCH -N 2
#SBATCH -t 2:00:00
#SBATCH -p xahcnormal
#SBATCH --mem=80G
#SBATCH -o out.%j
#SBATCH -e err.%j

source gchp.env

#################################################################
#
# ADDITIONAL PRE-RUN CONFIGURATION
#
# If a subsequent command fails, treat it as fatal (don't continue)
set -e

# For remainder of script, echo commands to the job's log file
set -x

# Unlimit resources to prevent OS killing GCHP due to resource usage/
# Alternatively you can put this in your environment file.
ulimit -c 0                  # coredumpsize
ulimit -l unlimited          # memorylocked
ulimit -u 50000              # maxproc
ulimit -v unlimited          # vmemoryuse
ulimit -s unlimited          # stacksize
module list

# Define log name to include simulation start date
start_str=$(sed 's/ /_/g' cap_restart)
log=gchp.${start_str:0:13}z.log

# Update config files, set restart symlink, and do sanity checks
source setCommonRunSettings.sh
source setRestartLink.sh
source checkRunSettings.sh

# srun -n 96 -N 2 -m plane=24 --mpi=pmix ./gchp > ${log}
mpirun -np 96 ./gchp > ${log}

#################################################################
#
# POST-RUN COMMANDS
#

# Rename mid-run checkpoint files, if any. Discard file if time corresponds
# to run start time since duplicate with initial restart file.
chkpnts=$(ls Restarts)
for chkpnt in ${chkpnts}
do
    if [[ "$chkpnt" == *"gcchem_internal_checkpoint."* ]]; then
       chkpnt_time=${chkpnt:27:13}
       if [[ "${chkpnt_time}" = "${start_str:0:13}" ]]; then
          rm ./Restarts/${chkpnt}
       else
          new_chkpnt=./Restarts/GEOSChem.Restart.${chkpnt_time}z.c${N}.nc4
          mv ./Restarts/${chkpnt} ${new_chkpnt}
       fi
    fi
done
# If new start time in cap_restart is okay, rename restart file
# and update restart symlink
new_start_str=$(sed 's/ /_/g' cap_restart)
if [[ "${new_start_str}" = "${start_str}" || "${new_start_str}" = "" ]]; then
    echo "ERROR: GCHP failed to run to completion. Check the log file for more information."
    rm -f Restarts/gcchem_internal_checkpoint
    exit 1
else
    N=$(grep "CS_RES=" setCommonRunSettings.sh | cut -c 8- | xargs )
    mv Restarts/gcchem_internal_checkpoint Restarts/GEOSChem.Restart.${new_start_str:0:13}z.c${N}.nc4
    source setRestartLink.sh
fi

The gchp.log shows that it just started running normally, but I noticed that the gchp.log output:

     object: 331,name: STATE_PSC
            type: Field
     object: 332,name: T_DAVG
            type: Field
     object: 333,name: T_PREVDAY
            type: Field
     object: 334,name: TropLev
            type: Field
     object: 335,name: WetDepNitrogen
            type: Field

After this gchp.log no longer outputs, but no errors are reported and the tasks I submitted keep running. This seems to indicate that GCHP has stopped running but is still hogging the CPU, what could this be?

allPEs.txt gchp.20190101_0000z.txt logfile.000000.out.txt

lizziel commented 9 months ago

This looks like an issue with the restart file. What simulation are you running (e.g. standard versus benchmark) and what version of the model are you using?

lizziel commented 9 months ago

Also, please make a new issue for this since it is a new subject. I will close out this issue after we move there.