esmf-org / esmf

The Earth System Modeling Framework (ESMF) is a suite of software tools for developing high-performance, multi-component Earth science modeling applications.
https://earthsystemmodeling.org/
Other
149 stars 70 forks source link

ESMFI_Clock wrong data value on GCP with UFS #248

Open weihuang-jedi opened 1 month ago

weihuang-jedi commented 1 month ago

Hello there, I am trying to run NOAA_EMC's global-workflow with UFS atm forecast only, and get error message:

[Wei.Huang@whcgepic-34 fcst.15036]$ more PET21.ESMF_LogFile 20240508 144120.013 ERROR PET21 ESMCI_Clock.C:373 ESMCI::Clock::set() Wrong data value - Internal subroutine c all returned Error 20240508 144120.013 ERROR PET21 ESMF_Clock.F90:1695 ESMF_ClockSet() Wrong data value - Internal subroutine cal l returned Error 20240508 144120.013 ERROR PET21 UFS.F90:373 Wrong data value - Aborting UFS 20240508 144120.013 INFO PET21 Finalizing ESMF with endflag==ESMF_END_ABORT 20240508 161808.515 ERROR PET21 ESMCI_Clock.C:373 ESMCI::Clock::set() Wrong data value - Internal subroutine c all returned Error 20240508 161808.515 ERROR PET21 ESMF_Clock.F90:1695 ESMF_ClockSet() Wrong data value - Internal subroutine cal l returned Error 20240508 161808.515 ERROR PET21 UFS.F90:373 Wrong data value - Aborting UFS 20240508 161808.515 INFO PET21 Finalizing ESMF with endflag==ESMF_END_ABORT 20240508 164511.748 ERROR PET21 ESMCI_Clock.C:373 ESMCI::Clock::set() Wrong data value - Internal subroutine c all returned Error 20240508 164511.748 ERROR PET21 ESMF_Clock.F90:1695 ESMF_ClockSet() Wrong data value - Internal subroutine cal l returned Error 20240508 164511.748 ERROR PET21 UFS.F90:373 Wrong data value - Aborting UFS

at UFS side, we see error:

Linux weihuang-whcgepic-00034-1-0001 3.10.0-1160.88.1.el7.x86_64 #1 SMP Tue Mar 7 15:41:52 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux 0: 0: 0: . . . . . . . . . . . . . . . . . . . . 0: PROGRAM ufs-weather-model HAS BEGUN. COMPILED 0.00 ORG: np23 0: STARTING DATE-TIME MAY 13,2024 14:43:40.978 134 MON 2460444 0: 0: 0: MPI Library = Intel(R) MPI Library 2021.3 for Linux* OS 0: 0: MPI Version = 3.1 21: Abort(1) on node 21 (rank 21 in comm 496): application called MPI_Abort(comm=0x84000002, 1) - process 21 25: Abort(1) on node 25 (rank 25 in comm 496): application called MPI_Abort(comm=0x84000002, 1) - process 25 29: Abort(1) on node 29 (rank 29 in comm 496): application called MPI_Abort(comm=0x84000002, 1) - process 29 27: Abort(1) on node 27 (rank 27 in comm 496): application called MPI_Abort(comm=0x84000002, 1) - process 27 26: Abort(1) on node 26 (rank 26 in comm 496): application called MPI_Abort(comm=0x84000002, 1) - process 26 23: Abort(1) on node 23 (rank 23 in comm 496): application called MPI_Abort(comm=0x84000002, 1) - process 23 srun: error: weihuang-whcgepic-00034-1-0001: tasks 21,25,27,29: Exited with exit code 1

The simple slurm script to reproduce this error is:

[Wei.Huang@whcgepic-34 fcst.15036]$ cat run.slurm

!/bin/bash

SBATCH --job-name=gfsfcst

SBATCH --account=$USER

SBATCH --qos=batch

SBATCH --partition=compute

SBATCH -t 01:15:00

SBATCH --nodes=1

SBATCH --tasks-per-node=36

SBATCH --cpus-per-task=1

SBATCH -o gfsfcst.%J.log

SBATCH --export=NONE

SBATCH --exclusive

set -x

export HOMEgfs=/contrib/$USER/src/global-workflow-cloud source ${HOMEgfs}/workflow/gw_setup.sh

module list

cd /contrib/Wei.Huang/stmp/RUNDIRS/c48atm/gfsfcst.2024010100/fcst.15036

uname -a

I_MPI_FABRICS=ofi | shm:ofi | shm

export I_MPI_FABRICS=shm:ofi

srun --mpi=pmi2 -l -n 30 --distribution=block:block --hint=nomultithread --cpus-per-task=1 ./ufs_model.x

srun --mpi=pmi2 -l -n 30 ./ufs_model.x

oehmke commented 1 month ago

Hi, It looks like the clock validate function is returning an error, so there may be an invalid value (e.g. a 0 time step) in the clock after it's set. Is there more information above that in the log file? If so, that might give more context about what the precise issue is. Thanks.

weihuang-jedi commented 1 month ago

I do not have other information, but:

@.*** fcst.11531]$ cat PET20.ESMF_LogFile 20240513 235517.133 ERROR PET20 ESMCI_Clock.C:373 ESMCI::Clock::set() Wrong data value - Internal subroutine call returned Error 20240513 235517.133 ERROR PET20 ESMF_Clock.F90:1695 ESMF_ClockSet() Wrong data value - Internal subroutine call returned Error 20240513 235517.133 ERROR PET20 UFS.F90:373 Wrong data value - Aborting UFS 20240513 235517.133 INFO PET20 Finalizing ESMF 20240514 140351.214 ERROR PET20 ESMCI_Clock.C:373 ESMCI::Clock::set() Wrong data value - Internal subroutine call returned Error 20240514 140351.214 ERROR PET20 ESMF_Clock.F90:1695 ESMF_ClockSet() Wrong data value - Internal subroutine call returned Error 20240514 140351.214 ERROR PET20 UFS.F90:373 Wrong data value - Aborting UFS 20240514 140351.214 INFO PET20 Finalizing ESMF

One thing bother me more is that this is running on Google cloud, where we have two accounts, one account has this error, the other runs fine.

The ufs-weather-model mpi error is:

@.*** fcst.11531]$ more gfsfcst.2.log

Currently Loaded Modules: 1) rocoto/1.3.3 10) py-markupsafe/2.1.3 2) intel/2021.3.0 11) py-jinja2/3.0.3 3) stack-intel/2021.3.0 12) libyaml/0.2.5 4) gettext/0.19.8.1 13) py-pyyaml/6.0 5) libxcrypt/4.4.35 14) openblas/0.3.24 6) zlib/1.2.13 15) py-setuptools/63.4.3 7) sqlite/3.43.2 16) py-numpy/1.22.3 8) util-linux-uuid/2.38.1 17) git/1.8.3.1 9) python/3.10.13 18) module_gwsetup.noaacloud

Linux weihuang-whcgepic-00035-1-0001 3.10.0-1160.88.1.el7.x86_64 #1 SMP Tue Mar 7 15:41:52 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux 0: MPI startup(): shm:tcp fabric is unknown or has been removed from the product, please use ofi or shm:ofi instead. 0: 0: 0: . . . . . . . . . . . . . . . . . .

On Tue, May 14, 2024 at 10:21 AM oehmke @.***> wrote:

Hi, It looks like the clock validate function is returning an error, so there may be an invalid value (e.g. a 0 time step) in the clock after it's set. Is there more information above that in the log file? If so, that might give more context about what the precise issue is. Thanks.

— Reply to this email directly, view it on GitHub https://github.com/esmf-org/esmf/issues/248#issuecomment-2110639249, or unsubscribe https://github.com/notifications/unsubscribe-auth/ASMCH6Y4VFBXPNA5RSALS63ZCI2XVAVCNFSM6AAAAABHUMLCWKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMJQGYZTSMRUHE . You are receiving this because you authored the thread.Message ID: @.***>

weihuang-jedi commented 1 month ago

More info how ufs-weather-model started:

ESMF

logKindFlag: ESMF_LOGKIND_MULTI_ON_ERROR globalResourceControl: true

EARTH

EARTH_component_list: ATM EARTH_attributes:: Verbosity = 0 ::

ATM

ATM_model: fv3 ATM_petlist_bounds: 0 11 ATM_omp_num_threads: 1 ATM_attributes:: Verbosity = 0 Diagnostic = 0 ::

Run Sequence

runSeq:: ATM ::

On Tue, May 14, 2024 at 10:27 AM Wei Huang - NOAA Affiliate < @.***> wrote:

I do not have other information, but:

@.*** fcst.11531]$ cat PET20.ESMF_LogFile 20240513 235517.133 ERROR PET20 ESMCI_Clock.C:373 ESMCI::Clock::set() Wrong data value - Internal subroutine call returned Error 20240513 235517.133 ERROR PET20 ESMF_Clock.F90:1695 ESMF_ClockSet() Wrong data value - Internal subroutine call returned Error 20240513 235517.133 ERROR PET20 UFS.F90:373 Wrong data value - Aborting UFS 20240513 235517.133 INFO PET20 Finalizing ESMF 20240514 140351.214 ERROR PET20 ESMCI_Clock.C:373 ESMCI::Clock::set() Wrong data value - Internal subroutine call returned Error 20240514 140351.214 ERROR PET20 ESMF_Clock.F90:1695 ESMF_ClockSet() Wrong data value - Internal subroutine call returned Error 20240514 140351.214 ERROR PET20 UFS.F90:373 Wrong data value - Aborting UFS 20240514 140351.214 INFO PET20 Finalizing ESMF

One thing bother me more is that this is running on Google cloud, where we have two accounts, one account has this error, the other runs fine.

The ufs-weather-model mpi error is:

@.*** fcst.11531]$ more gfsfcst.2.log

Currently Loaded Modules: 1) rocoto/1.3.3 10) py-markupsafe/2.1.3 2) intel/2021.3.0 11) py-jinja2/3.0.3 3) stack-intel/2021.3.0 12) libyaml/0.2.5 4) gettext/0.19.8.1 13) py-pyyaml/6.0 5) libxcrypt/4.4.35 14) openblas/0.3.24 6) zlib/1.2.13 15) py-setuptools/63.4.3 7) sqlite/3.43.2 16) py-numpy/1.22.3 8) util-linux-uuid/2.38.1 17) git/1.8.3.1 9) python/3.10.13 18) module_gwsetup.noaacloud

Linux weihuang-whcgepic-00035-1-0001 3.10.0-1160.88.1.el7.x86_64 #1 SMP Tue Mar 7 15:41:52 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux 0: MPI startup(): shm:tcp fabric is unknown or has been removed from the product, please use ofi or shm:ofi instead. 0: 0: 0: . . . . . . . . . . . . . . . . . . . . 0: PROGRAM ufs-weather-model HAS BEGUN. COMPILED 0.00 ORG: np23 0: STARTING DATE-TIME MAY 14,2024 14:03:49.975 135 TUE 2460445 0: 0: 0: MPI Library = Intel(R) MPI Library 2021.3 for Linux* OS 0: 0: MPI Version = 3.1 26: Abort(1) on node 26 (rank 26 in comm 496): application called MPI_Abort(comm=0x84000002, 1) - process 26 20: Abort(1) on node 20 (rank 20 in comm 496): application called MPI_Abort(comm=0x84000002, 1) - process 20 23: Abort(1) on node 23 (rank 23 in comm 496): application called MPI_Abort(comm=0x84000002, 1) - process 23 25: Abort(1) on node 25 (rank 25 in comm 496): application called MPI_Abort(comm=0x84000002, 1) - process 25 srun: error: weihuang-whcgepic-00035-1-0001: tasks 20,23,26: Exited with exit code 1 28: Abort(1) on node 28 (rank 28 in comm 496): application called MPI_Abort(comm=0x84000002, 1) - process 28 22: Abort(1) on node 22 (rank 22 in comm 496): application called MPI_Abort(comm=0x84000002, 1) - process 22 24: Abort(1) on node 24 (rank 24 in comm 496): application called MPI_Abort(comm=0x84000002, 1) - process 24 srun: error: weihuang-whcgepic-00035-1-0001: tasks 25,28: Exited with exit code 1

On Tue, May 14, 2024 at 10:21 AM oehmke @.***> wrote:

Hi, It looks like the clock validate function is returning an error, so there may be an invalid value (e.g. a 0 time step) in the clock after it's set. Is there more information above that in the log file? If so, that might give more context about what the precise issue is. Thanks.

— Reply to this email directly, view it on GitHub https://github.com/esmf-org/esmf/issues/248#issuecomment-2110639249, or unsubscribe https://github.com/notifications/unsubscribe-auth/ASMCH6Y4VFBXPNA5RSALS63ZCI2XVAVCNFSM6AAAAABHUMLCWKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMJQGYZTSMRUHE . You are receiving this because you authored the thread.Message ID: @.***>

oehmke commented 1 month ago

There should be more information in the PETLogFile. For example, usually there is information about the version of ESMF at the top. Are there other PETLogfiles that have more information? It’s surprising to me that it’s just giving you the end. Do you know where the ESMF_ClockSet() is called in UFS? If so, could you send me what that call looks like? (I.e. copy and paste it and a few lines around it.) Thanks!

That is strange that it runs on one and not the other account. Are they the same machine image?

On May 14, 2024, at 10:32 AM, Wei Huang @.***> wrote:

More info how ufs-weather-model started:

  • atparse.bash[5]: set +x
  • parsing_ufs_configure.sh[98]: echo 'Rendered ufs.configure:' Rendered ufs.configure:
  • parsing_ufs_configure.sh[99]: cat ufs.configure #############################################

    UFS Run-Time Configuration File

    #############################################

ESMF

logKindFlag: ESMF_LOGKIND_MULTI_ON_ERROR globalResourceControl: true

EARTH

EARTH_component_list: ATM EARTH_attributes:: Verbosity = 0 ::

ATM

ATM_model: fv3 ATM_petlist_bounds: 0 11 ATM_omp_num_threads: 1 ATM_attributes:: Verbosity = 0 Diagnostic = 0 ::

Run Sequence

runSeq:: ATM ::

  • parsing_ufs_configure.sh[101]: /bin/cp -p /contrib/Wei.Huang/src/global-workflow-cloud/sorc/ufs_model.fd/tests/parm/fd_ufs.yaml fd_ufs.yaml
  • parsing_ufs_configure.sh[103]: echo 'SUB UFS_configure: ufs.configure ends' SUB UFS_configure: ufs.configure ends
  • exglobal_forecast.sh[136]: echo 'MAIN: Name lists and model configuration written' MAIN: Name lists and model configuration written
  • exglobal_forecast.sh[141]: [[ .false. = .\t\r\u\e. ]]
  • exglobal_forecast.sh[146]: [[ YES == \Y\E\S ]]
  • exglobal_forecast.sh[147]: unset OMP_NUM_THREADS
  • exglobal_forecast.sh[152]: /bin/cp -p /contrib/Wei.Huang/src/global-workflow-cloud/exec/ufs_model.x /contrib/Wei.Huang/stmp/RUNDIRS/c48atm/gfsfcst.2024010100/fcst.11531/
  • exglobal_forecast.sh[153]: srun --mpi=pmi2 -l -n 24 /contrib/Wei.Huang/stmp/RUNDIRS/c48atm/gfsfcst.2024010100/fcst.11531/ufs_model.x

On Tue, May 14, 2024 at 10:27 AM Wei Huang - NOAA Affiliate < @.***> wrote:

I do not have other information, but:

@.*** fcst.11531]$ cat PET20.ESMF_LogFile 20240513 235517.133 ERROR PET20 ESMCI_Clock.C:373 ESMCI::Clock::set() Wrong data value - Internal subroutine call returned Error 20240513 235517.133 ERROR PET20 ESMF_Clock.F90:1695 ESMF_ClockSet() Wrong data value - Internal subroutine call returned Error 20240513 235517.133 ERROR PET20 UFS.F90:373 Wrong data value - Aborting UFS 20240513 235517.133 INFO PET20 Finalizing ESMF 20240514 140351.214 ERROR PET20 ESMCI_Clock.C:373 ESMCI::Clock::set() Wrong data value - Internal subroutine call returned Error 20240514 140351.214 ERROR PET20 ESMF_Clock.F90:1695 ESMF_ClockSet() Wrong data value - Internal subroutine call returned Error 20240514 140351.214 ERROR PET20 UFS.F90:373 Wrong data value - Aborting UFS 20240514 140351.214 INFO PET20 Finalizing ESMF

One thing bother me more is that this is running on Google cloud, where we have two accounts, one account has this error, the other runs fine.

The ufs-weather-model mpi error is:

@.*** fcst.11531]$ more gfsfcst.2.log

Currently Loaded Modules: 1) rocoto/1.3.3 10) py-markupsafe/2.1.3 2) intel/2021.3.0 11) py-jinja2/3.0.3 3) stack-intel/2021.3.0 12) libyaml/0.2.5 4) gettext/0.19.8.1 13) py-pyyaml/6.0 5) libxcrypt/4.4.35 14) openblas/0.3.24 6) zlib/1.2.13 15) py-setuptools/63.4.3 7) sqlite/3.43.2 16) py-numpy/1.22.3 8) util-linux-uuid/2.38.1 17) git/1.8.3.1 9) python/3.10.13 18) module_gwsetup.noaacloud

Linux weihuang-whcgepic-00035-1-0001 3.10.0-1160.88.1.el7.x86_64 #1 SMP Tue Mar 7 15:41:52 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux 0: MPI startup(): shm:tcp fabric is unknown or has been removed from the product, please use ofi or shm:ofi instead. 0: 0: 0: . . . . . . . . . . . . . . . . . . . . 0: PROGRAM ufs-weather-model HAS BEGUN. COMPILED 0.00 ORG: np23 0: STARTING DATE-TIME MAY 14,2024 14:03:49.975 135 TUE 2460445 0: 0: 0: MPI Library = Intel(R) MPI Library 2021.3 for Linux* OS 0: 0: MPI Version = 3.1 26: Abort(1) on node 26 (rank 26 in comm 496): application called MPI_Abort(comm=0x84000002, 1) - process 26 20: Abort(1) on node 20 (rank 20 in comm 496): application called MPI_Abort(comm=0x84000002, 1) - process 20 23: Abort(1) on node 23 (rank 23 in comm 496): application called MPI_Abort(comm=0x84000002, 1) - process 23 25: Abort(1) on node 25 (rank 25 in comm 496): application called MPI_Abort(comm=0x84000002, 1) - process 25 srun: error: weihuang-whcgepic-00035-1-0001: tasks 20,23,26: Exited with exit code 1 28: Abort(1) on node 28 (rank 28 in comm 496): application called MPI_Abort(comm=0x84000002, 1) - process 28 22: Abort(1) on node 22 (rank 22 in comm 496): application called MPI_Abort(comm=0x84000002, 1) - process 22 24: Abort(1) on node 24 (rank 24 in comm 496): application called MPI_Abort(comm=0x84000002, 1) - process 24 srun: error: weihuang-whcgepic-00035-1-0001: tasks 25,28: Exited with exit code 1

On Tue, May 14, 2024 at 10:21 AM oehmke @.***> wrote:

Hi, It looks like the clock validate function is returning an error, so there may be an invalid value (e.g. a 0 time step) in the clock after it's set. Is there more information above that in the log file? If so, that might give more context about what the precise issue is. Thanks.

— Reply to this email directly, view it on GitHub https://github.com/esmf-org/esmf/issues/248#issuecomment-2110639249, or unsubscribe https://github.com/notifications/unsubscribe-auth/ASMCH6Y4VFBXPNA5RSALS63ZCI2XVAVCNFSM6AAAAABHUMLCWKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMJQGYZTSMRUHE . You are receiving this because you authored the thread.Message ID: @.***>

— Reply to this email directly, view it on GitHub https://github.com/esmf-org/esmf/issues/248#issuecomment-2110660524, or unsubscribe https://github.com/notifications/unsubscribe-auth/AE6A7U55XEOYRU3C7GAG4XTZCI4DNAVCNFSM6AAAAABHUMLCWKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMJQGY3DANJSGQ. You are receiving this because you were assigned.

weihuang-jedi commented 1 month ago

That is all the info/msg I see during the run.

I am not very familiar with UFS-weather-model code, the code is at: https://github.com/ufs-community/ufs-weather-model

You may see the call to ESMF_ClockSet as below.

Search GitHub: repo:ufs-community/ufs-weather-model ESMF_ClockSet https://github.com/issues https://github.com/pulls https://github.com/notifications code Search Results · repo:ufs-community/ufs-weather-model ESMF_ClockSet Filter by

-

1 file (61 ms)1 fileinufs-community/ufs-weather-model https://github.com/ufs-community/ufs-weather-model(press backspace or delete to remove) Save driver/UFS.F90 https://github.com/ufs-community/ufs-weather-model/blob/b2668e84f3d3046482c422db9ec7b11cfbcbb79b/driver/UFS.F90#L370

- Fortran Free Form ·

367 https://github.com/ufs-community/ufs-weather-model/blob/b2668e84f3d3046482c422db9ec7b11cfbcbb79b/driver/UFS.F90#L367 CALL ESMF_TimeIntervalSet(restartOffset, h_r8=fhrot, rc=RC) 368 https://github.com/ufs-community/ufs-weather-model/blob/b2668e84f3d3046482c422db9ec7b11cfbcbb79b/driver/UFS.F90#L368 ESMF_ERR_ABORT(RC) 369 https://github.com/ufs-community/ufs-weather-model/blob/b2668e84f3d3046482c422db9ec7b11cfbcbb79b/driver/UFS.F90#L369 CURRTIME = STARTTIME + restartOffset 370 https://github.com/ufs-community/ufs-weather-model/blob/b2668e84f3d3046482c422db9ec7b11cfbcbb79b/driver/UFS.F90#L370 call ESMF_ClockSet(CLOCK_MAIN, currTime=CURRTIME, & 371 https://github.com/ufs-community/ufs-weather-model/blob/b2668e84f3d3046482c422db9ec7b11cfbcbb79b/driver/UFS.F90#L371 timeStep=(TIMESTEP-restartOffset), & 372 https://github.com/ufs-community/ufs-weather-model/blob/b2668e84f3d3046482c422db9ec7b11cfbcbb79b/driver/UFS.F90#L372 rc=RC) 373 https://github.com/ufs-community/ufs-weather-model/blob/b2668e84f3d3046482c422db9ec7b11cfbcbb79b/driver/UFS.F90#L373 ESMF_ERR_ABORT(RC)

On Tue, May 14, 2024 at 11:29 AM oehmke @.***> wrote:

There should be more information in the PETLogFile. For example, usually there is information about the version of ESMF at the top. Are there other PETLogfiles that have more information? It’s surprising to me that it’s just giving you the end. Do you know where the ESMF_ClockSet() is called in UFS? If so, could you send me what that call looks like? (I.e. copy and paste it and a few lines around it.) Thanks!

That is strange that it runs on one and not the other account. Are they the same machine image?

On May 14, 2024, at 10:32 AM, Wei Huang @.***> wrote:

More info how ufs-weather-model started:

  • atparse.bash[5]: set +x
  • parsing_ufs_configure.sh[98]: echo 'Rendered ufs.configure:' Rendered ufs.configure:
  • parsing_ufs_configure.sh[99]: cat ufs.configure #############################################

    UFS Run-Time Configuration File

    #############################################

ESMF

logKindFlag: ESMF_LOGKIND_MULTI_ON_ERROR globalResourceControl: true

EARTH

EARTH_component_list: ATM EARTH_attributes:: Verbosity = 0 ::

ATM

ATM_model: fv3 ATM_petlist_bounds: 0 11 ATM_omp_num_threads: 1 ATM_attributes:: Verbosity = 0 Diagnostic = 0 ::

Run Sequence

runSeq:: ATM ::

  • parsing_ufs_configure.sh[101]: /bin/cp -p

/contrib/Wei.Huang/src/global-workflow-cloud/sorc/ufs_model.fd/tests/parm/fd_ufs.yaml

fd_ufs.yaml

  • parsing_ufs_configure.sh[103]: echo 'SUB UFS_configure: ufs.configure ends' SUB UFS_configure: ufs.configure ends
  • exglobal_forecast.sh[136]: echo 'MAIN: Name lists and model configuration written' MAIN: Name lists and model configuration written
  • exglobal_forecast.sh[141]: [[ .false. = .\t\r\u\e. ]]
  • exglobal_forecast.sh[146]: [[ YES == \Y\E\S ]]
  • exglobal_forecast.sh[147]: unset OMP_NUM_THREADS
  • exglobal_forecast.sh[152]: /bin/cp -p /contrib/Wei.Huang/src/global-workflow-cloud/exec/ufs_model.x /contrib/Wei.Huang/stmp/RUNDIRS/c48atm/gfsfcst.2024010100/fcst.11531/
  • exglobal_forecast.sh[153]: srun --mpi=pmi2 -l -n 24

/contrib/Wei.Huang/stmp/RUNDIRS/c48atm/gfsfcst.2024010100/fcst.11531/ufs_model.x

On Tue, May 14, 2024 at 10:27 AM Wei Huang - NOAA Affiliate < @.***> wrote:

I do not have other information, but:

@.*** fcst.11531]$ cat PET20.ESMF_LogFile 20240513 235517.133 ERROR PET20 ESMCI_Clock.C:373 ESMCI::Clock::set() Wrong data value - Internal subroutine call returned Error 20240513 235517.133 ERROR PET20 ESMF_Clock.F90:1695 ESMF_ClockSet() Wrong data value - Internal subroutine call returned Error 20240513 235517.133 ERROR PET20 UFS.F90:373 Wrong data value - Aborting UFS 20240513 235517.133 INFO PET20 Finalizing ESMF 20240514 140351.214 ERROR PET20 ESMCI_Clock.C:373 ESMCI::Clock::set() Wrong data value - Internal subroutine call returned Error 20240514 140351.214 ERROR PET20 ESMF_Clock.F90:1695 ESMF_ClockSet() Wrong data value - Internal subroutine call returned Error 20240514 140351.214 ERROR PET20 UFS.F90:373 Wrong data value - Aborting UFS 20240514 140351.214 INFO PET20 Finalizing ESMF

One thing bother me more is that this is running on Google cloud, where we have two accounts, one account has this error, the other runs fine.

The ufs-weather-model mpi error is:

@.*** fcst.11531]$ more gfsfcst.2.log

Currently Loaded Modules: 1) rocoto/1.3.3 10) py-markupsafe/2.1.3 2) intel/2021.3.0 11) py-jinja2/3.0.3 3) stack-intel/2021.3.0 12) libyaml/0.2.5 4) gettext/0.19.8.1 13) py-pyyaml/6.0 5) libxcrypt/4.4.35 14) openblas/0.3.24 6) zlib/1.2.13 15) py-setuptools/63.4.3 7) sqlite/3.43.2 16) py-numpy/1.22.3 8) util-linux-uuid/2.38.1 17) git/1.8.3.1 9) python/3.10.13 18) module_gwsetup.noaacloud

Linux weihuang-whcgepic-00035-1-0001 3.10.0-1160.88.1.el7.x86_64 #1 SMP Tue Mar 7 15:41:52 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux 0: MPI startup(): shm:tcp fabric is unknown or has been removed from the product, please use ofi or shm:ofi instead. 0: 0: 0: . . . . . . . . . . . . . . . . . . . . 0: PROGRAM ufs-weather-model HAS BEGUN. COMPILED 0.00 ORG: np23 0: STARTING DATE-TIME MAY 14,2024 14:03:49.975 135 TUE 2460445 0: 0: 0: MPI Library = Intel(R) MPI Library 2021.3 for Linux* OS 0: 0: MPI Version = 3.1 26: Abort(1) on node 26 (rank 26 in comm 496): application called MPI_Abort(comm=0x84000002, 1) - process 26 20: Abort(1) on node 20 (rank 20 in comm 496): application called MPI_Abort(comm=0x84000002, 1) - process 20 23: Abort(1) on node 23 (rank 23 in comm 496): application called MPI_Abort(comm=0x84000002, 1) - process 23 25: Abort(1) on node 25 (rank 25 in comm 496): application called MPI_Abort(comm=0x84000002, 1) - process 25 srun: error: weihuang-whcgepic-00035-1-0001: tasks 20,23,26: Exited with exit code 1 28: Abort(1) on node 28 (rank 28 in comm 496): application called MPI_Abort(comm=0x84000002, 1) - process 28 22: Abort(1) on node 22 (rank 22 in comm 496): application called MPI_Abort(comm=0x84000002, 1) - process 22 24: Abort(1) on node 24 (rank 24 in comm 496): application called MPI_Abort(comm=0x84000002, 1) - process 24 srun: error: weihuang-whcgepic-00035-1-0001: tasks 25,28: Exited with exit code 1

On Tue, May 14, 2024 at 10:21 AM oehmke @.***> wrote:

Hi, It looks like the clock validate function is returning an error, so there may be an invalid value (e.g. a 0 time step) in the clock after it's set. Is there more information above that in the log file? If so, that might give more context about what the precise issue is. Thanks.

— Reply to this email directly, view it on GitHub https://github.com/esmf-org/esmf/issues/248#issuecomment-2110639249,

or unsubscribe < https://github.com/notifications/unsubscribe-auth/ASMCH6Y4VFBXPNA5RSALS63ZCI2XVAVCNFSM6AAAAABHUMLCWKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMJQGYZTSMRUHE>

. You are receiving this because you authored the thread.Message ID: @.***>

— Reply to this email directly, view it on GitHub < https://github.com/esmf-org/esmf/issues/248#issuecomment-2110660524>, or unsubscribe < https://github.com/notifications/unsubscribe-auth/AE6A7U55XEOYRU3C7GAG4XTZCI4DNAVCNFSM6AAAAABHUMLCWKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMJQGY3DANJSGQ>.

You are receiving this because you were assigned.

— Reply to this email directly, view it on GitHub https://github.com/esmf-org/esmf/issues/248#issuecomment-2110765800, or unsubscribe https://github.com/notifications/unsubscribe-auth/ASMCH624YUOWIYFK2GCYQA3ZCJCYTAVCNFSM6AAAAABHUMLCWKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMJQG43DKOBQGA . You are receiving this because you authored the thread.Message ID: @.***>

oehmke commented 1 month ago

Hmmm, it would be useful to have more information to help debug. If I give you a modified version of the ESMF code, could you run with that?

On May 14, 2024, at 11:44 AM, Wei Huang @.***> wrote:

That is all the info/msg I see during the run.

I am not very familiar with UFS-weather-model code, the code is at: https://github.com/ufs-community/ufs-weather-model

You may see the call to ESMF_ClockSet as below.

Search GitHub: repo:ufs-community/ufs-weather-model ESMF_ClockSet https://github.com/issues https://github.com/pulls https://github.com/notifications code Search Results · repo:ufs-community/ufs-weather-model ESMF_ClockSet Filter by

  • Code, 1 results1 (1)

https://github.com/search?q=repo%3Aufs-community%2Fufs-weather-model+ESMF_ClockSet&type=code

  • Issues, 0 results0 (0)

https://github.com/search?q=repo%3Aufs-community%2Fufs-weather-model+ESMF_ClockSet&type=issues

  • Pull requests, 0 results0 (0)

https://github.com/search?q=repo%3Aufs-community%2Fufs-weather-model+ESMF_ClockSet&type=pullrequests

  • Discussions, 0 results0 (0)

https://github.com/search?q=repo%3Aufs-community%2Fufs-weather-model+ESMF_ClockSet&type=discussions

  • Commits, 0 results0 (0)

https://github.com/search?q=repo%3Aufs-community%2Fufs-weather-model+ESMF_ClockSet&type=commits

  • Packages, 0 results0 (0)

https://github.com/search?q=repo%3Aufs-community%2Fufs-weather-model+ESMF_ClockSet&type=registrypackages

  • Wikis, 0 results0 (0)

https://github.com/search?q=repo%3Aufs-community%2Fufs-weather-model+ESMF_ClockSet&type=wikis

  • Advanced
  • ‎Owner‎
  • ‎Symbol‎
  • ‎Exclude archived‎

https://github.com/search?q=repo%3Aufs-community%2Fufs-weather-model+ESMF_ClockSet+NOT+is%3Aarchived&type=code

1 file (61 ms)1 fileinufs-community/ufs-weather-model https://github.com/ufs-community/ufs-weather-model(press backspace or delete to remove) Save driver/UFS.F90 https://github.com/ufs-community/ufs-weather-model/blob/b2668e84f3d3046482c422db9ec7b11cfbcbb79b/driver/UFS.F90#L370

367 https://github.com/ufs-community/ufs-weather-model/blob/b2668e84f3d3046482c422db9ec7b11cfbcbb79b/driver/UFS.F90#L367 CALL ESMF_TimeIntervalSet(restartOffset, h_r8=fhrot, rc=RC) 368 https://github.com/ufs-community/ufs-weather-model/blob/b2668e84f3d3046482c422db9ec7b11cfbcbb79b/driver/UFS.F90#L368 ESMF_ERR_ABORT(RC) 369 https://github.com/ufs-community/ufs-weather-model/blob/b2668e84f3d3046482c422db9ec7b11cfbcbb79b/driver/UFS.F90#L369 CURRTIME = STARTTIME + restartOffset 370 https://github.com/ufs-community/ufs-weather-model/blob/b2668e84f3d3046482c422db9ec7b11cfbcbb79b/driver/UFS.F90#L370 call ESMF_ClockSet(CLOCK_MAIN, currTime=CURRTIME, & 371 https://github.com/ufs-community/ufs-weather-model/blob/b2668e84f3d3046482c422db9ec7b11cfbcbb79b/driver/UFS.F90#L371 timeStep=(TIMESTEP-restartOffset), & 372 https://github.com/ufs-community/ufs-weather-model/blob/b2668e84f3d3046482c422db9ec7b11cfbcbb79b/driver/UFS.F90#L372 rc=RC) 373 https://github.com/ufs-community/ufs-weather-model/blob/b2668e84f3d3046482c422db9ec7b11cfbcbb79b/driver/UFS.F90#L373 ESMF_ERR_ABORT(RC)

On Tue, May 14, 2024 at 11:29 AM oehmke @.***> wrote:

There should be more information in the PETLogFile. For example, usually there is information about the version of ESMF at the top. Are there other PETLogfiles that have more information? It’s surprising to me that it’s just giving you the end. Do you know where the ESMF_ClockSet() is called in UFS? If so, could you send me what that call looks like? (I.e. copy and paste it and a few lines around it.) Thanks!

That is strange that it runs on one and not the other account. Are they the same machine image?

On May 14, 2024, at 10:32 AM, Wei Huang @.***> wrote:

More info how ufs-weather-model started:

  • atparse.bash[5]: set +x
  • parsing_ufs_configure.sh[98]: echo 'Rendered ufs.configure:' Rendered ufs.configure:
  • parsing_ufs_configure.sh[99]: cat ufs.configure #############################################

    UFS Run-Time Configuration File

    #############################################

ESMF

logKindFlag: ESMF_LOGKIND_MULTI_ON_ERROR globalResourceControl: true

EARTH

EARTH_component_list: ATM EARTH_attributes:: Verbosity = 0 ::

ATM

ATM_model: fv3 ATM_petlist_bounds: 0 11 ATM_omp_num_threads: 1 ATM_attributes:: Verbosity = 0 Diagnostic = 0 ::

Run Sequence

runSeq:: ATM ::

  • parsing_ufs_configure.sh[101]: /bin/cp -p

/contrib/Wei.Huang/src/global-workflow-cloud/sorc/ufs_model.fd/tests/parm/fd_ufs.yaml

fd_ufs.yaml

  • parsing_ufs_configure.sh[103]: echo 'SUB UFS_configure: ufs.configure ends' SUB UFS_configure: ufs.configure ends
  • exglobal_forecast.sh[136]: echo 'MAIN: Name lists and model configuration written' MAIN: Name lists and model configuration written
  • exglobal_forecast.sh[141]: [[ .false. = .\t\r\u\e. ]]
  • exglobal_forecast.sh[146]: [[ YES == \Y\E\S ]]
  • exglobal_forecast.sh[147]: unset OMP_NUM_THREADS
  • exglobal_forecast.sh[152]: /bin/cp -p /contrib/Wei.Huang/src/global-workflow-cloud/exec/ufs_model.x /contrib/Wei.Huang/stmp/RUNDIRS/c48atm/gfsfcst.2024010100/fcst.11531/
  • exglobal_forecast.sh[153]: srun --mpi=pmi2 -l -n 24

/contrib/Wei.Huang/stmp/RUNDIRS/c48atm/gfsfcst.2024010100/fcst.11531/ufs_model.x

On Tue, May 14, 2024 at 10:27 AM Wei Huang - NOAA Affiliate < @.***> wrote:

I do not have other information, but:

@.*** fcst.11531]$ cat PET20.ESMF_LogFile 20240513 235517.133 ERROR PET20 ESMCI_Clock.C:373 ESMCI::Clock::set() Wrong data value - Internal subroutine call returned Error 20240513 235517.133 ERROR PET20 ESMF_Clock.F90:1695 ESMF_ClockSet() Wrong data value - Internal subroutine call returned Error 20240513 235517.133 ERROR PET20 UFS.F90:373 Wrong data value - Aborting UFS 20240513 235517.133 INFO PET20 Finalizing ESMF 20240514 140351.214 ERROR PET20 ESMCI_Clock.C:373 ESMCI::Clock::set() Wrong data value - Internal subroutine call returned Error 20240514 140351.214 ERROR PET20 ESMF_Clock.F90:1695 ESMF_ClockSet() Wrong data value - Internal subroutine call returned Error 20240514 140351.214 ERROR PET20 UFS.F90:373 Wrong data value - Aborting UFS 20240514 140351.214 INFO PET20 Finalizing ESMF

One thing bother me more is that this is running on Google cloud, where we have two accounts, one account has this error, the other runs fine.

The ufs-weather-model mpi error is:

@.*** fcst.11531]$ more gfsfcst.2.log

Currently Loaded Modules: 1) rocoto/1.3.3 10) py-markupsafe/2.1.3 2) intel/2021.3.0 11) py-jinja2/3.0.3 3) stack-intel/2021.3.0 12) libyaml/0.2.5 4) gettext/0.19.8.1 13) py-pyyaml/6.0 5) libxcrypt/4.4.35 14) openblas/0.3.24 6) zlib/1.2.13 15) py-setuptools/63.4.3 7) sqlite/3.43.2 16) py-numpy/1.22.3 8) util-linux-uuid/2.38.1 17) git/1.8.3.1 9) python/3.10.13 18) module_gwsetup.noaacloud

Linux weihuang-whcgepic-00035-1-0001 3.10.0-1160.88.1.el7.x86_64 #1 SMP Tue Mar 7 15:41:52 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux 0: MPI startup(): shm:tcp fabric is unknown or has been removed from the product, please use ofi or shm:ofi instead. 0: 0: 0: . . . . . . . . . . . . . . . . * .

  • . . . 0: PROGRAM ufs-weather-model HAS BEGUN. COMPILED 0.00 ORG: np23 0: STARTING DATE-TIME MAY 14,2024 14:03:49.975 135 TUE 2460445 0: 0: 0: MPI Library = Intel(R) MPI Library 2021.3 for Linux* OS 0: 0: MPI Version = 3.1 26: Abort(1) on node 26 (rank 26 in comm 496): application called MPI_Abort(comm=0x84000002, 1) - process 26 20: Abort(1) on node 20 (rank 20 in comm 496): application called MPI_Abort(comm=0x84000002, 1) - process 20 23: Abort(1) on node 23 (rank 23 in comm 496): application called MPI_Abort(comm=0x84000002, 1) - process 23 25: Abort(1) on node 25 (rank 25 in comm 496): application called MPI_Abort(comm=0x84000002, 1) - process 25 srun: error: weihuang-whcgepic-00035-1-0001: tasks 20,23,26: Exited with exit code 1 28: Abort(1) on node 28 (rank 28 in comm 496): application called MPI_Abort(comm=0x84000002, 1) - process 28 22: Abort(1) on node 22 (rank 22 in comm 496): application called MPI_Abort(comm=0x84000002, 1) - process 22 24: Abort(1) on node 24 (rank 24 in comm 496): application called MPI_Abort(comm=0x84000002, 1) - process 24 srun: error: weihuang-whcgepic-00035-1-0001: tasks 25,28: Exited with exit code 1

On Tue, May 14, 2024 at 10:21 AM oehmke @.***> wrote:

Hi, It looks like the clock validate function is returning an error, so there may be an invalid value (e.g. a 0 time step) in the clock after it's set. Is there more information above that in the log file? If so, that might give more context about what the precise issue is. Thanks.

— Reply to this email directly, view it on GitHub https://github.com/esmf-org/esmf/issues/248#issuecomment-2110639249,

or unsubscribe < https://github.com/notifications/unsubscribe-auth/ASMCH6Y4VFBXPNA5RSALS63ZCI2XVAVCNFSM6AAAAABHUMLCWKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMJQGYZTSMRUHE>

. You are receiving this because you authored the thread.Message ID: @.***>

— Reply to this email directly, view it on GitHub < https://github.com/esmf-org/esmf/issues/248#issuecomment-2110660524>, or unsubscribe < https://github.com/notifications/unsubscribe-auth/AE6A7U55XEOYRU3C7GAG4XTZCI4DNAVCNFSM6AAAAABHUMLCWKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMJQGY3DANJSGQ>.

You are receiving this because you were assigned.

— Reply to this email directly, view it on GitHub https://github.com/esmf-org/esmf/issues/248#issuecomment-2110765800, or unsubscribe https://github.com/notifications/unsubscribe-auth/ASMCH624YUOWIYFK2GCYQA3ZCJCYTAVCNFSM6AAAAABHUMLCWKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMJQG43DKOBQGA . You are receiving this because you authored the thread.Message ID: @.***>

— Reply to this email directly, view it on GitHub https://github.com/esmf-org/esmf/issues/248#issuecomment-2110788549, or unsubscribe https://github.com/notifications/unsubscribe-auth/AE6A7UYORA2LNQ3JRWXVQ6TZCJEPJAVCNFSM6AAAAABHUMLCWKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMJQG44DQNJUHE. You are receiving this because you were assigned.

weihuang-jedi commented 1 month ago

That is pretty hard, as lots of things has already bundled together. To replace one, here ESMF, is not straightforward.

On Tue, May 14, 2024 at 5:23 PM oehmke @.***> wrote:

Hmmm, it would be useful to have more information to help debug. If I give you a modified version of the ESMF code, could you run with that?

On May 14, 2024, at 11:44 AM, Wei Huang @.***> wrote:

That is all the info/msg I see during the run.

I am not very familiar with UFS-weather-model code, the code is at: https://github.com/ufs-community/ufs-weather-model

You may see the call to ESMF_ClockSet as below.

Search GitHub: repo:ufs-community/ufs-weather-model ESMF_ClockSet https://github.com/issues https://github.com/pulls https://github.com/notifications code Search Results · repo:ufs-community/ufs-weather-model ESMF_ClockSet Filter by

-

  • Code, 1 results1 (1)

< https://github.com/search?q=repo%3Aufs-community%2Fufs-weather-model+ESMF_ClockSet&type=code>

  • Issues, 0 results0 (0)

< https://github.com/search?q=repo%3Aufs-community%2Fufs-weather-model+ESMF_ClockSet&type=issues>

  • Pull requests, 0 results0 (0)

< https://github.com/search?q=repo%3Aufs-community%2Fufs-weather-model+ESMF_ClockSet&type=pullrequests>

  • Discussions, 0 results0 (0)

< https://github.com/search?q=repo%3Aufs-community%2Fufs-weather-model+ESMF_ClockSet&type=discussions>

  • Commits, 0 results0 (0)

< https://github.com/search?q=repo%3Aufs-community%2Fufs-weather-model+ESMF_ClockSet&type=commits>

  • Packages, 0 results0 (0)

< https://github.com/search?q=repo%3Aufs-community%2Fufs-weather-model+ESMF_ClockSet&type=registrypackages>

  • Wikis, 0 results0 (0)

< https://github.com/search?q=repo%3Aufs-community%2Fufs-weather-model+ESMF_ClockSet&type=wikis>

  • Advanced
  • ‎Owner‎
  • ‎Symbol‎
  • ‎Exclude archived‎

< https://github.com/search?q=repo%3Aufs-community%2Fufs-weather-model+ESMF_ClockSet+NOT+is%3Aarchived&type=code>

-

1 file (61 ms)1 fileinufs-community/ufs-weather-model https://github.com/ufs-community/ufs-weather-model(press backspace or delete to remove) Save driver/UFS.F90 < https://github.com/ufs-community/ufs-weather-model/blob/b2668e84f3d3046482c422db9ec7b11cfbcbb79b/driver/UFS.F90#L370>

- Fortran Free Form ·

367 < https://github.com/ufs-community/ufs-weather-model/blob/b2668e84f3d3046482c422db9ec7b11cfbcbb79b/driver/UFS.F90#L367>

CALL ESMF_TimeIntervalSet(restartOffset, h_r8=fhrot, rc=RC) 368 < https://github.com/ufs-community/ufs-weather-model/blob/b2668e84f3d3046482c422db9ec7b11cfbcbb79b/driver/UFS.F90#L368>

ESMF_ERR_ABORT(RC) 369 < https://github.com/ufs-community/ufs-weather-model/blob/b2668e84f3d3046482c422db9ec7b11cfbcbb79b/driver/UFS.F90#L369>

CURRTIME = STARTTIME + restartOffset 370 < https://github.com/ufs-community/ufs-weather-model/blob/b2668e84f3d3046482c422db9ec7b11cfbcbb79b/driver/UFS.F90#L370>

call ESMF_ClockSet(CLOCK_MAIN, currTime=CURRTIME, & 371 < https://github.com/ufs-community/ufs-weather-model/blob/b2668e84f3d3046482c422db9ec7b11cfbcbb79b/driver/UFS.F90#L371>

timeStep=(TIMESTEP-restartOffset), & 372 < https://github.com/ufs-community/ufs-weather-model/blob/b2668e84f3d3046482c422db9ec7b11cfbcbb79b/driver/UFS.F90#L372>

rc=RC) 373 < https://github.com/ufs-community/ufs-weather-model/blob/b2668e84f3d3046482c422db9ec7b11cfbcbb79b/driver/UFS.F90#L373>

ESMF_ERR_ABORT(RC)

On Tue, May 14, 2024 at 11:29 AM oehmke @.***> wrote:

There should be more information in the PETLogFile. For example, usually there is information about the version of ESMF at the top. Are there other PETLogfiles that have more information? It’s surprising to me that it’s just giving you the end. Do you know where the ESMF_ClockSet() is called in UFS? If so, could you send me what that call looks like? (I.e. copy and paste it and a few lines around it.) Thanks!

That is strange that it runs on one and not the other account. Are they the same machine image?

On May 14, 2024, at 10:32 AM, Wei Huang @.***> wrote:

More info how ufs-weather-model started:

  • atparse.bash[5]: set +x
  • parsing_ufs_configure.sh[98]: echo 'Rendered ufs.configure:' Rendered ufs.configure:
  • parsing_ufs_configure.sh[99]: cat ufs.configure #############################################

    UFS Run-Time Configuration File

    #############################################

ESMF

logKindFlag: ESMF_LOGKIND_MULTI_ON_ERROR globalResourceControl: true

EARTH

EARTH_component_list: ATM EARTH_attributes:: Verbosity = 0 ::

ATM

ATM_model: fv3 ATM_petlist_bounds: 0 11 ATM_omp_num_threads: 1 ATM_attributes:: Verbosity = 0 Diagnostic = 0 ::

Run Sequence

runSeq:: ATM ::

  • parsing_ufs_configure.sh[101]: /bin/cp -p

/contrib/Wei.Huang/src/global-workflow-cloud/sorc/ufs_model.fd/tests/parm/fd_ufs.yaml

fd_ufs.yaml

  • parsing_ufs_configure.sh[103]: echo 'SUB UFS_configure: ufs.configure ends' SUB UFS_configure: ufs.configure ends
  • exglobal_forecast.sh[136]: echo 'MAIN: Name lists and model configuration written' MAIN: Name lists and model configuration written
  • exglobal_forecast.sh[141]: [[ .false. = .\t\r\u\e. ]]
  • exglobal_forecast.sh[146]: [[ YES == \Y\E\S ]]
  • exglobal_forecast.sh[147]: unset OMP_NUM_THREADS
  • exglobal_forecast.sh[152]: /bin/cp -p /contrib/Wei.Huang/src/global-workflow-cloud/exec/ufs_model.x

/contrib/Wei.Huang/stmp/RUNDIRS/c48atm/gfsfcst.2024010100/fcst.11531/

  • exglobal_forecast.sh[153]: srun --mpi=pmi2 -l -n 24

/contrib/Wei.Huang/stmp/RUNDIRS/c48atm/gfsfcst.2024010100/fcst.11531/ufs_model.x

On Tue, May 14, 2024 at 10:27 AM Wei Huang - NOAA Affiliate < @.***> wrote:

I do not have other information, but:

@.*** fcst.11531]$ cat PET20.ESMF_LogFile 20240513 235517.133 ERROR PET20 ESMCI_Clock.C:373 ESMCI::Clock::set() Wrong data value - Internal subroutine call returned Error 20240513 235517.133 ERROR PET20 ESMF_Clock.F90:1695 ESMF_ClockSet() Wrong data value - Internal subroutine call returned Error 20240513 235517.133 ERROR PET20 UFS.F90:373 Wrong data value - Aborting UFS 20240513 235517.133 INFO PET20 Finalizing ESMF 20240514 140351.214 ERROR PET20 ESMCI_Clock.C:373 ESMCI::Clock::set() Wrong data value - Internal subroutine call returned Error 20240514 140351.214 ERROR PET20 ESMF_Clock.F90:1695 ESMF_ClockSet() Wrong data value - Internal subroutine call returned Error 20240514 140351.214 ERROR PET20 UFS.F90:373 Wrong data value - Aborting UFS 20240514 140351.214 INFO PET20 Finalizing ESMF

One thing bother me more is that this is running on Google cloud, where we have two accounts, one account has this error, the other runs fine.

The ufs-weather-model mpi error is:

@.*** fcst.11531]$ more gfsfcst.2.log

Currently Loaded Modules: 1) rocoto/1.3.3 10) py-markupsafe/2.1.3 2) intel/2021.3.0 11) py-jinja2/3.0.3 3) stack-intel/2021.3.0 12) libyaml/0.2.5 4) gettext/0.19.8.1 13) py-pyyaml/6.0 5) libxcrypt/4.4.35 14) openblas/0.3.24 6) zlib/1.2.13 15) py-setuptools/63.4.3 7) sqlite/3.43.2 16) py-numpy/1.22.3 8) util-linux-uuid/2.38.1 17) git/1.8.3.1 9) python/3.10.13 18) module_gwsetup.noaacloud

Linux weihuang-whcgepic-00035-1-0001 3.10.0-1160.88.1.el7.x86_64

1

SMP Tue Mar 7 15:41:52 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux 0: MPI startup(): shm:tcp fabric is unknown or has been removed from the product, please use ofi or shm:ofi instead. 0: 0: 0: . . . . . . . . . . . . . . . .

  • .
  • . . . 0: PROGRAM ufs-weather-model HAS BEGUN. COMPILED 0.00 ORG: np23 0: STARTING DATE-TIME MAY 14,2024 14:03:49.975 135 TUE 2460445 0: 0: 0: MPI Library = Intel(R) MPI Library 2021.3 for Linux* OS 0: 0: MPI Version = 3.1 26: Abort(1) on node 26 (rank 26 in comm 496): application called MPI_Abort(comm=0x84000002, 1) - process 26 20: Abort(1) on node 20 (rank 20 in comm 496): application called MPI_Abort(comm=0x84000002, 1) - process 20 23: Abort(1) on node 23 (rank 23 in comm 496): application called MPI_Abort(comm=0x84000002, 1) - process 23 25: Abort(1) on node 25 (rank 25 in comm 496): application called MPI_Abort(comm=0x84000002, 1) - process 25 srun: error: weihuang-whcgepic-00035-1-0001: tasks 20,23,26: Exited with exit code 1 28: Abort(1) on node 28 (rank 28 in comm 496): application called MPI_Abort(comm=0x84000002, 1) - process 28 22: Abort(1) on node 22 (rank 22 in comm 496): application called MPI_Abort(comm=0x84000002, 1) - process 22 24: Abort(1) on node 24 (rank 24 in comm 496): application called MPI_Abort(comm=0x84000002, 1) - process 24 srun: error: weihuang-whcgepic-00035-1-0001: tasks 25,28: Exited with exit code 1

On Tue, May 14, 2024 at 10:21 AM oehmke @.***> wrote:

Hi, It looks like the clock validate function is returning an error, so there may be an invalid value (e.g. a 0 time step) in the clock after it's set. Is there more information above that in the log file? If so, that might give more context about what the precise issue is. Thanks.

— Reply to this email directly, view it on GitHub < https://github.com/esmf-org/esmf/issues/248#issuecomment-2110639249>,

or unsubscribe <

https://github.com/notifications/unsubscribe-auth/ASMCH6Y4VFBXPNA5RSALS63ZCI2XVAVCNFSM6AAAAABHUMLCWKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMJQGYZTSMRUHE>

. You are receiving this because you authored the thread.Message ID: @.***>

— Reply to this email directly, view it on GitHub < https://github.com/esmf-org/esmf/issues/248#issuecomment-2110660524>, or unsubscribe <

https://github.com/notifications/unsubscribe-auth/AE6A7U55XEOYRU3C7GAG4XTZCI4DNAVCNFSM6AAAAABHUMLCWKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMJQGY3DANJSGQ>.

You are receiving this because you were assigned.

— Reply to this email directly, view it on GitHub https://github.com/esmf-org/esmf/issues/248#issuecomment-2110765800, or unsubscribe < https://github.com/notifications/unsubscribe-auth/ASMCH624YUOWIYFK2GCYQA3ZCJCYTAVCNFSM6AAAAABHUMLCWKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMJQG43DKOBQGA>

. You are receiving this because you authored the thread.Message ID: @.***>

— Reply to this email directly, view it on GitHub < https://github.com/esmf-org/esmf/issues/248#issuecomment-2110788549>, or unsubscribe < https://github.com/notifications/unsubscribe-auth/AE6A7UYORA2LNQ3JRWXVQ6TZCJEPJAVCNFSM6AAAAABHUMLCWKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMJQG44DQNJUHE>.

You are receiving this because you were assigned.

— Reply to this email directly, view it on GitHub https://github.com/esmf-org/esmf/issues/248#issuecomment-2111317998, or unsubscribe https://github.com/notifications/unsubscribe-auth/ASMCH66IDHTSBEMQQN3KKQDZCKMGZAVCNFSM6AAAAABHUMLCWKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMJRGMYTOOJZHA . You are receiving this because you authored the thread.Message ID: @.***>

weihuang-jedi commented 1 month ago

Remove the global-workflow code, and then re-clone, re-compile, then it runs fine. Can not explain why. but the executable size is slightly different.

I'll let you know if the problem comes back.

Regards,

Wei

On Tue, May 14, 2024 at 6:06 PM Wei Huang - NOAA Affiliate < @.***> wrote:

That is pretty hard, as lots of things has already bundled together. To replace one, here ESMF, is not straightforward.

On Tue, May 14, 2024 at 5:23 PM oehmke @.***> wrote:

Hmmm, it would be useful to have more information to help debug. If I give you a modified version of the ESMF code, could you run with that?

On May 14, 2024, at 11:44 AM, Wei Huang @.***> wrote:

That is all the info/msg I see during the run.

I am not very familiar with UFS-weather-model code, the code is at: https://github.com/ufs-community/ufs-weather-model

You may see the call to ESMF_ClockSet as below.

Search GitHub: repo:ufs-community/ufs-weather-model ESMF_ClockSet https://github.com/issues https://github.com/pulls https://github.com/notifications code Search Results · repo:ufs-community/ufs-weather-model ESMF_ClockSet Filter by

-

  • Code, 1 results1 (1)

< https://github.com/search?q=repo%3Aufs-community%2Fufs-weather-model+ESMF_ClockSet&type=code>

  • Issues, 0 results0 (0)

< https://github.com/search?q=repo%3Aufs-community%2Fufs-weather-model+ESMF_ClockSet&type=issues>

  • Pull requests, 0 results0 (0)

< https://github.com/search?q=repo%3Aufs-community%2Fufs-weather-model+ESMF_ClockSet&type=pullrequests>

  • Discussions, 0 results0 (0)

< https://github.com/search?q=repo%3Aufs-community%2Fufs-weather-model+ESMF_ClockSet&type=discussions>

  • Commits, 0 results0 (0)

< https://github.com/search?q=repo%3Aufs-community%2Fufs-weather-model+ESMF_ClockSet&type=commits>

  • Packages, 0 results0 (0)

< https://github.com/search?q=repo%3Aufs-community%2Fufs-weather-model+ESMF_ClockSet&type=registrypackages>

  • Wikis, 0 results0 (0)

< https://github.com/search?q=repo%3Aufs-community%2Fufs-weather-model+ESMF_ClockSet&type=wikis>

  • Advanced
  • ‎Owner‎
  • ‎Symbol‎
  • ‎Exclude archived‎

< https://github.com/search?q=repo%3Aufs-community%2Fufs-weather-model+ESMF_ClockSet+NOT+is%3Aarchived&type=code>

-

1 file (61 ms)1 fileinufs-community/ufs-weather-model https://github.com/ufs-community/ufs-weather-model(press backspace or delete to remove) Save driver/UFS.F90 < https://github.com/ufs-community/ufs-weather-model/blob/b2668e84f3d3046482c422db9ec7b11cfbcbb79b/driver/UFS.F90#L370>

- Fortran Free Form ·

367 < https://github.com/ufs-community/ufs-weather-model/blob/b2668e84f3d3046482c422db9ec7b11cfbcbb79b/driver/UFS.F90#L367>

CALL ESMF_TimeIntervalSet(restartOffset, h_r8=fhrot, rc=RC) 368 < https://github.com/ufs-community/ufs-weather-model/blob/b2668e84f3d3046482c422db9ec7b11cfbcbb79b/driver/UFS.F90#L368>

ESMF_ERR_ABORT(RC) 369 < https://github.com/ufs-community/ufs-weather-model/blob/b2668e84f3d3046482c422db9ec7b11cfbcbb79b/driver/UFS.F90#L369>

CURRTIME = STARTTIME + restartOffset 370 < https://github.com/ufs-community/ufs-weather-model/blob/b2668e84f3d3046482c422db9ec7b11cfbcbb79b/driver/UFS.F90#L370>

call ESMF_ClockSet(CLOCK_MAIN, currTime=CURRTIME, & 371 < https://github.com/ufs-community/ufs-weather-model/blob/b2668e84f3d3046482c422db9ec7b11cfbcbb79b/driver/UFS.F90#L371>

timeStep=(TIMESTEP-restartOffset), & 372 < https://github.com/ufs-community/ufs-weather-model/blob/b2668e84f3d3046482c422db9ec7b11cfbcbb79b/driver/UFS.F90#L372>

rc=RC) 373 < https://github.com/ufs-community/ufs-weather-model/blob/b2668e84f3d3046482c422db9ec7b11cfbcbb79b/driver/UFS.F90#L373>

ESMF_ERR_ABORT(RC)

On Tue, May 14, 2024 at 11:29 AM oehmke @.***> wrote:

There should be more information in the PETLogFile. For example, usually there is information about the version of ESMF at the top. Are there other PETLogfiles that have more information? It’s surprising to me that it’s just giving you the end. Do you know where the ESMF_ClockSet() is called in UFS? If so, could you send me what that call looks like? (I.e. copy and paste it and a few lines around it.) Thanks!

That is strange that it runs on one and not the other account. Are they the same machine image?

On May 14, 2024, at 10:32 AM, Wei Huang @.***> wrote:

More info how ufs-weather-model started:

  • atparse.bash[5]: set +x
  • parsing_ufs_configure.sh[98]: echo 'Rendered ufs.configure:' Rendered ufs.configure:
  • parsing_ufs_configure.sh[99]: cat ufs.configure #############################################

    UFS Run-Time Configuration File

    #############################################

ESMF

logKindFlag: ESMF_LOGKIND_MULTI_ON_ERROR globalResourceControl: true

EARTH

EARTH_component_list: ATM EARTH_attributes:: Verbosity = 0 ::

ATM

ATM_model: fv3 ATM_petlist_bounds: 0 11 ATM_omp_num_threads: 1 ATM_attributes:: Verbosity = 0 Diagnostic = 0 ::

Run Sequence

runSeq:: ATM ::

  • parsing_ufs_configure.sh[101]: /bin/cp -p

/contrib/Wei.Huang/src/global-workflow-cloud/sorc/ufs_model.fd/tests/parm/fd_ufs.yaml

fd_ufs.yaml

  • parsing_ufs_configure.sh[103]: echo 'SUB UFS_configure: ufs.configure ends' SUB UFS_configure: ufs.configure ends
  • exglobal_forecast.sh[136]: echo 'MAIN: Name lists and model configuration written' MAIN: Name lists and model configuration written
  • exglobal_forecast.sh[141]: [[ .false. = .\t\r\u\e. ]]
  • exglobal_forecast.sh[146]: [[ YES == \Y\E\S ]]
  • exglobal_forecast.sh[147]: unset OMP_NUM_THREADS
  • exglobal_forecast.sh[152]: /bin/cp -p /contrib/Wei.Huang/src/global-workflow-cloud/exec/ufs_model.x

/contrib/Wei.Huang/stmp/RUNDIRS/c48atm/gfsfcst.2024010100/fcst.11531/

  • exglobal_forecast.sh[153]: srun --mpi=pmi2 -l -n 24

/contrib/Wei.Huang/stmp/RUNDIRS/c48atm/gfsfcst.2024010100/fcst.11531/ufs_model.x

On Tue, May 14, 2024 at 10:27 AM Wei Huang - NOAA Affiliate < @.***> wrote:

I do not have other information, but:

@.*** fcst.11531]$ cat PET20.ESMF_LogFile 20240513 235517.133 ERROR PET20 ESMCI_Clock.C:373 ESMCI::Clock::set() Wrong data value - Internal subroutine call returned Error 20240513 235517.133 ERROR PET20 ESMF_Clock.F90:1695 ESMF_ClockSet() Wrong data value - Internal subroutine call returned Error 20240513 235517.133 ERROR PET20 UFS.F90:373 Wrong data value - Aborting UFS 20240513 235517.133 INFO PET20 Finalizing ESMF 20240514 140351.214 ERROR PET20 ESMCI_Clock.C:373 ESMCI::Clock::set() Wrong data value - Internal subroutine call returned Error 20240514 140351.214 ERROR PET20 ESMF_Clock.F90:1695 ESMF_ClockSet() Wrong data value - Internal subroutine call returned Error 20240514 140351.214 ERROR PET20 UFS.F90:373 Wrong data value - Aborting UFS 20240514 140351.214 INFO PET20 Finalizing ESMF

One thing bother me more is that this is running on Google cloud, where we have two accounts, one account has this error, the other runs fine.

The ufs-weather-model mpi error is:

@.*** fcst.11531]$ more gfsfcst.2.log

Currently Loaded Modules: 1) rocoto/1.3.3 10) py-markupsafe/2.1.3 2) intel/2021.3.0 11) py-jinja2/3.0.3 3) stack-intel/2021.3.0 12) libyaml/0.2.5 4) gettext/0.19.8.1 13) py-pyyaml/6.0 5) libxcrypt/4.4.35 14) openblas/0.3.24 6) zlib/1.2.13 15) py-setuptools/63.4.3 7) sqlite/3.43.2 16) py-numpy/1.22.3 8) util-linux-uuid/2.38.1 17) git/1.8.3.1 9) python/3.10.13 18) module_gwsetup.noaacloud

Linux weihuang-whcgepic-00035-1-0001 3.10.0-1160.88.1.el7.x86_64

1

SMP Tue Mar 7 15:41:52 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux 0: MPI startup(): shm:tcp fabric is unknown or has been removed from the product, please use ofi or shm:ofi instead. 0: 0: 0: . . . . . . . . . . . . . . . . . . . . 0: PROGRAM ufs-weather-model HAS BEGUN. COMPILED 0.00 ORG: np23 0: STARTING DATE-TIME MAY 14,2024 14:03:49.975 135 TUE 2460445 0: 0: 0: MPI Library = Intel(R) MPI Library 2021.3 for Linux* OS 0: 0: MPI Version = 3.1 26: Abort(1) on node 26 (rank 26 in comm 496): application called MPI_Abort(comm=0x84000002, 1) - process 26 20: Abort(1) on node 20 (rank 20 in comm 496): application called MPI_Abort(comm=0x84000002, 1) - process 20 23: Abort(1) on node 23 (rank 23 in comm 496): application called MPI_Abort(comm=0x84000002, 1) - process 23 25: Abort(1) on node 25 (rank 25 in comm 496): application called MPI_Abort(comm=0x84000002, 1) - process 25 srun: error: weihuang-whcgepic-00035-1-0001: tasks 20,23,26: Exited with exit code 1 28: Abort(1) on node 28 (rank 28 in comm 496): application called MPI_Abort(comm=0x84000002, 1) - process 28 22: Abort(1) on node 22 (rank 22 in comm 496): application called MPI_Abort(comm=0x84000002, 1) - process 22 24: Abort(1) on node 24 (rank 24 in comm 496): application called MPI_Abort(comm=0x84000002, 1) - process 24 srun: error: weihuang-whcgepic-00035-1-0001: tasks 25,28: Exited with exit code 1

On Tue, May 14, 2024 at 10:21 AM oehmke @.***> wrote:

Hi, It looks like the clock validate function is returning an error, so there may be an invalid value (e.g. a 0 time step) in the clock after it's set. Is there more information above that in the log file? If so, that might give more context about what the precise issue is. Thanks.

— Reply to this email directly, view it on GitHub < https://github.com/esmf-org/esmf/issues/248#issuecomment-2110639249>,

or unsubscribe <

https://github.com/notifications/unsubscribe-auth/ASMCH6Y4VFBXPNA5RSALS63ZCI2XVAVCNFSM6AAAAABHUMLCWKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMJQGYZTSMRUHE>

. You are receiving this because you authored the thread.Message ID: @.***>

— Reply to this email directly, view it on GitHub < https://github.com/esmf-org/esmf/issues/248#issuecomment-2110660524>, or unsubscribe <

https://github.com/notifications/unsubscribe-auth/AE6A7U55XEOYRU3C7GAG4XTZCI4DNAVCNFSM6AAAAABHUMLCWKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMJQGY3DANJSGQ>.

You are receiving this because you were assigned.

— Reply to this email directly, view it on GitHub https://github.com/esmf-org/esmf/issues/248#issuecomment-2110765800, or unsubscribe < https://github.com/notifications/unsubscribe-auth/ASMCH624YUOWIYFK2GCYQA3ZCJCYTAVCNFSM6AAAAABHUMLCWKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMJQG43DKOBQGA>

. You are receiving this because you authored the thread.Message ID: @.***>

— Reply to this email directly, view it on GitHub < https://github.com/esmf-org/esmf/issues/248#issuecomment-2110788549>, or unsubscribe < https://github.com/notifications/unsubscribe-auth/AE6A7UYORA2LNQ3JRWXVQ6TZCJEPJAVCNFSM6AAAAABHUMLCWKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMJQG44DQNJUHE>.

You are receiving this because you were assigned.

— Reply to this email directly, view it on GitHub https://github.com/esmf-org/esmf/issues/248#issuecomment-2111317998, or unsubscribe https://github.com/notifications/unsubscribe-auth/ASMCH66IDHTSBEMQQN3KKQDZCKMGZAVCNFSM6AAAAABHUMLCWKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMJRGMYTOOJZHA . You are receiving this because you authored the thread.Message ID: @.***>

oehmke commented 1 month ago

Very strange. Yes, do let us know if it comes back. One thing that I was thinking about was that a lot of the values being used to set the clock were from a Config file. I wondered it it was possible that the file was missing or in a different place in this account vs. the other that worked? However, that doesn’t really jive with this recent result. Let’s see if it comes back, maybe that’ll give us more information.

Cheers,

On May 17, 2024, at 11:51 AM, Wei Huang @.***> wrote:

Remove the global-workflow code, and then re-clone, re-compile, then it runs fine. Can not explain why. but the executable size is slightly different.

I'll let you know if the problem comes back.

Regards,

Wei

On Tue, May 14, 2024 at 6:06 PM Wei Huang - NOAA Affiliate < @.***> wrote:

That is pretty hard, as lots of things has already bundled together. To replace one, here ESMF, is not straightforward.

On Tue, May 14, 2024 at 5:23 PM oehmke @.***> wrote:

Hmmm, it would be useful to have more information to help debug. If I give you a modified version of the ESMF code, could you run with that?

On May 14, 2024, at 11:44 AM, Wei Huang @.***> wrote:

That is all the info/msg I see during the run.

I am not very familiar with UFS-weather-model code, the code is at: https://github.com/ufs-community/ufs-weather-model

You may see the call to ESMF_ClockSet as below.

Search GitHub: repo:ufs-community/ufs-weather-model ESMF_ClockSet https://github.com/issues https://github.com/pulls https://github.com/notifications code Search Results · repo:ufs-community/ufs-weather-model ESMF_ClockSet Filter by

  • Code, 1 results1 (1)

< https://github.com/search?q=repo%3Aufs-community%2Fufs-weather-model+ESMF_ClockSet&type=code>

  • Issues, 0 results0 (0)

< https://github.com/search?q=repo%3Aufs-community%2Fufs-weather-model+ESMF_ClockSet&type=issues>

  • Pull requests, 0 results0 (0)

< https://github.com/search?q=repo%3Aufs-community%2Fufs-weather-model+ESMF_ClockSet&type=pullrequests>

  • Discussions, 0 results0 (0)

< https://github.com/search?q=repo%3Aufs-community%2Fufs-weather-model+ESMF_ClockSet&type=discussions>

  • Commits, 0 results0 (0)

< https://github.com/search?q=repo%3Aufs-community%2Fufs-weather-model+ESMF_ClockSet&type=commits>

  • Packages, 0 results0 (0)

< https://github.com/search?q=repo%3Aufs-community%2Fufs-weather-model+ESMF_ClockSet&type=registrypackages>

  • Wikis, 0 results0 (0)

< https://github.com/search?q=repo%3Aufs-community%2Fufs-weather-model+ESMF_ClockSet&type=wikis>

  • Advanced
  • ‎Owner‎
  • ‎Symbol‎
  • ‎Exclude archived‎

< https://github.com/search?q=repo%3Aufs-community%2Fufs-weather-model+ESMF_ClockSet+NOT+is%3Aarchived&type=code>

1 file (61 ms)1 fileinufs-community/ufs-weather-model https://github.com/ufs-community/ufs-weather-model(press backspace or delete to remove) Save driver/UFS.F90 < https://github.com/ufs-community/ufs-weather-model/blob/b2668e84f3d3046482c422db9ec7b11cfbcbb79b/driver/UFS.F90#L370>

367 < https://github.com/ufs-community/ufs-weather-model/blob/b2668e84f3d3046482c422db9ec7b11cfbcbb79b/driver/UFS.F90#L367>

CALL ESMF_TimeIntervalSet(restartOffset, h_r8=fhrot, rc=RC) 368 < https://github.com/ufs-community/ufs-weather-model/blob/b2668e84f3d3046482c422db9ec7b11cfbcbb79b/driver/UFS.F90#L368>

ESMF_ERR_ABORT(RC) 369 < https://github.com/ufs-community/ufs-weather-model/blob/b2668e84f3d3046482c422db9ec7b11cfbcbb79b/driver/UFS.F90#L369>

CURRTIME = STARTTIME + restartOffset 370 < https://github.com/ufs-community/ufs-weather-model/blob/b2668e84f3d3046482c422db9ec7b11cfbcbb79b/driver/UFS.F90#L370>

call ESMF_ClockSet(CLOCK_MAIN, currTime=CURRTIME, & 371 < https://github.com/ufs-community/ufs-weather-model/blob/b2668e84f3d3046482c422db9ec7b11cfbcbb79b/driver/UFS.F90#L371>

timeStep=(TIMESTEP-restartOffset), & 372 < https://github.com/ufs-community/ufs-weather-model/blob/b2668e84f3d3046482c422db9ec7b11cfbcbb79b/driver/UFS.F90#L372>

rc=RC) 373 < https://github.com/ufs-community/ufs-weather-model/blob/b2668e84f3d3046482c422db9ec7b11cfbcbb79b/driver/UFS.F90#L373>

ESMF_ERR_ABORT(RC)

On Tue, May 14, 2024 at 11:29 AM oehmke @.***> wrote:

There should be more information in the PETLogFile. For example, usually there is information about the version of ESMF at the top. Are there other PETLogfiles that have more information? It’s surprising to me that it’s just giving you the end. Do you know where the ESMF_ClockSet() is called in UFS? If so, could you send me what that call looks like? (I.e. copy and paste it and a few lines around it.) Thanks!

That is strange that it runs on one and not the other account. Are they the same machine image?

On May 14, 2024, at 10:32 AM, Wei Huang @.***> wrote:

More info how ufs-weather-model started:

  • atparse.bash[5]: set +x
  • parsing_ufs_configure.sh[98]: echo 'Rendered ufs.configure:' Rendered ufs.configure:
  • parsing_ufs_configure.sh[99]: cat ufs.configure #############################################

    UFS Run-Time Configuration File

    #############################################

ESMF

logKindFlag: ESMF_LOGKIND_MULTI_ON_ERROR globalResourceControl: true

EARTH

EARTH_component_list: ATM EARTH_attributes:: Verbosity = 0 ::

ATM

ATM_model: fv3 ATM_petlist_bounds: 0 11 ATM_omp_num_threads: 1 ATM_attributes:: Verbosity = 0 Diagnostic = 0 ::

Run Sequence

runSeq:: ATM ::

  • parsing_ufs_configure.sh[101]: /bin/cp -p

/contrib/Wei.Huang/src/global-workflow-cloud/sorc/ufs_model.fd/tests/parm/fd_ufs.yaml

fd_ufs.yaml

  • parsing_ufs_configure.sh[103]: echo 'SUB UFS_configure: ufs.configure ends' SUB UFS_configure: ufs.configure ends
  • exglobal_forecast.sh[136]: echo 'MAIN: Name lists and model configuration written' MAIN: Name lists and model configuration written
  • exglobal_forecast.sh[141]: [[ .false. = .\t\r\u\e. ]]
  • exglobal_forecast.sh[146]: [[ YES == \Y\E\S ]]
  • exglobal_forecast.sh[147]: unset OMP_NUM_THREADS
  • exglobal_forecast.sh[152]: /bin/cp -p /contrib/Wei.Huang/src/global-workflow-cloud/exec/ufs_model.x

/contrib/Wei.Huang/stmp/RUNDIRS/c48atm/gfsfcst.2024010100/fcst.11531/

  • exglobal_forecast.sh[153]: srun --mpi=pmi2 -l -n 24

/contrib/Wei.Huang/stmp/RUNDIRS/c48atm/gfsfcst.2024010100/fcst.11531/ufs_model.x

On Tue, May 14, 2024 at 10:27 AM Wei Huang - NOAA Affiliate < @.***> wrote:

I do not have other information, but:

@.*** fcst.11531]$ cat PET20.ESMF_LogFile 20240513 235517.133 ERROR PET20 ESMCI_Clock.C:373 ESMCI::Clock::set() Wrong data value - Internal subroutine call returned Error 20240513 235517.133 ERROR PET20 ESMF_Clock.F90:1695 ESMF_ClockSet() Wrong data value - Internal subroutine call returned Error 20240513 235517.133 ERROR PET20 UFS.F90:373 Wrong data value - Aborting UFS 20240513 235517.133 INFO PET20 Finalizing ESMF 20240514 140351.214 ERROR PET20 ESMCI_Clock.C:373 ESMCI::Clock::set() Wrong data value - Internal subroutine call returned Error 20240514 140351.214 ERROR PET20 ESMF_Clock.F90:1695 ESMF_ClockSet() Wrong data value - Internal subroutine call returned Error 20240514 140351.214 ERROR PET20 UFS.F90:373 Wrong data value - Aborting UFS 20240514 140351.214 INFO PET20 Finalizing ESMF

One thing bother me more is that this is running on Google cloud, where we have two accounts, one account has this error, the other runs fine.

The ufs-weather-model mpi error is:

@.*** fcst.11531]$ more gfsfcst.2.log

Currently Loaded Modules: 1) rocoto/1.3.3 10) py-markupsafe/2.1.3 2) intel/2021.3.0 11) py-jinja2/3.0.3 3) stack-intel/2021.3.0 12) libyaml/0.2.5 4) gettext/0.19.8.1 13) py-pyyaml/6.0 5) libxcrypt/4.4.35 14) openblas/0.3.24 6) zlib/1.2.13 15) py-setuptools/63.4.3 7) sqlite/3.43.2 16) py-numpy/1.22.3 8) util-linux-uuid/2.38.1 17) git/1.8.3.1 9) python/3.10.13 18) module_gwsetup.noaacloud

Linux weihuang-whcgepic-00035-1-0001 3.10.0-1160.88.1.el7.x86_64

1

SMP Tue Mar 7 15:41:52 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux 0: MPI startup(): shm:tcp fabric is unknown or has been removed from the product, please use ofi or shm:ofi instead. 0: 0: 0: . . . . . . . . . . . . . . . . * .

  • . . . 0: PROGRAM ufs-weather-model HAS BEGUN. COMPILED 0.00 ORG: np23 0: STARTING DATE-TIME MAY 14,2024 14:03:49.975 135 TUE 2460445 0: 0: 0: MPI Library = Intel(R) MPI Library 2021.3 for Linux* OS 0: 0: MPI Version = 3.1 26: Abort(1) on node 26 (rank 26 in comm 496): application called MPI_Abort(comm=0x84000002, 1) - process 26 20: Abort(1) on node 20 (rank 20 in comm 496): application called MPI_Abort(comm=0x84000002, 1) - process 20 23: Abort(1) on node 23 (rank 23 in comm 496): application called MPI_Abort(comm=0x84000002, 1) - process 23 25: Abort(1) on node 25 (rank 25 in comm 496): application called MPI_Abort(comm=0x84000002, 1) - process 25 srun: error: weihuang-whcgepic-00035-1-0001: tasks 20,23,26: Exited with exit code 1 28: Abort(1) on node 28 (rank 28 in comm 496): application called MPI_Abort(comm=0x84000002, 1) - process 28 22: Abort(1) on node 22 (rank 22 in comm 496): application called MPI_Abort(comm=0x84000002, 1) - process 22 24: Abort(1) on node 24 (rank 24 in comm 496): application called MPI_Abort(comm=0x84000002, 1) - process 24 srun: error: weihuang-whcgepic-00035-1-0001: tasks 25,28: Exited with exit code 1

On Tue, May 14, 2024 at 10:21 AM oehmke @.***> wrote:

Hi, It looks like the clock validate function is returning an error, so there may be an invalid value (e.g. a 0 time step) in the clock after it's set. Is there more information above that in the log file? If so, that might give more context about what the precise issue is. Thanks.

— Reply to this email directly, view it on GitHub < https://github.com/esmf-org/esmf/issues/248#issuecomment-2110639249>,

or unsubscribe <

https://github.com/notifications/unsubscribe-auth/ASMCH6Y4VFBXPNA5RSALS63ZCI2XVAVCNFSM6AAAAABHUMLCWKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMJQGYZTSMRUHE>

. You are receiving this because you authored the thread.Message ID: @.***>

— Reply to this email directly, view it on GitHub < https://github.com/esmf-org/esmf/issues/248#issuecomment-2110660524>, or unsubscribe <

https://github.com/notifications/unsubscribe-auth/AE6A7U55XEOYRU3C7GAG4XTZCI4DNAVCNFSM6AAAAABHUMLCWKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMJQGY3DANJSGQ>.

You are receiving this because you were assigned.

— Reply to this email directly, view it on GitHub https://github.com/esmf-org/esmf/issues/248#issuecomment-2110765800, or unsubscribe < https://github.com/notifications/unsubscribe-auth/ASMCH624YUOWIYFK2GCYQA3ZCJCYTAVCNFSM6AAAAABHUMLCWKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMJQG43DKOBQGA>

. You are receiving this because you authored the thread.Message ID: @.***>

— Reply to this email directly, view it on GitHub < https://github.com/esmf-org/esmf/issues/248#issuecomment-2110788549>, or unsubscribe < https://github.com/notifications/unsubscribe-auth/AE6A7UYORA2LNQ3JRWXVQ6TZCJEPJAVCNFSM6AAAAABHUMLCWKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMJQG44DQNJUHE>.

You are receiving this because you were assigned.

— Reply to this email directly, view it on GitHub https://github.com/esmf-org/esmf/issues/248#issuecomment-2111317998, or unsubscribe https://github.com/notifications/unsubscribe-auth/ASMCH66IDHTSBEMQQN3KKQDZCKMGZAVCNFSM6AAAAABHUMLCWKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMJRGMYTOOJZHA . You are receiving this because you authored the thread.Message ID: @.***>

— Reply to this email directly, view it on GitHub https://github.com/esmf-org/esmf/issues/248#issuecomment-2118115837, or unsubscribe https://github.com/notifications/unsubscribe-auth/AE6A7U3W7BFYUEEHZQCSNALZCY7Q7AVCNFSM6AAAAABHUMLCWKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMJYGEYTKOBTG4. You are receiving this because you were assigned.