NCAR / DART

Data Assimilation Research Testbed
https://dart.ucar.edu/
Apache License 2.0
197 stars 145 forks source link

bug: fix bounds fix_bound_violations = .true. seems to be required for ifort #709

Open hkershaw-brown opened 3 months ago

hkershaw-brown commented 3 months ago

:bug: Your bug may already be reported! Please search on the issue tracker before creating a new issue.

Describe the bug

  1. List the steps someone needs to take to reproduce the bug.

/glade/derecho/scratch/hkershaw/DART/Bugs/bgunn_qceff/DART/models/lorenz_96_tracer_advection/work Following https://github.com/NCAR/DART/blob/l96_tracer_tests/models/lorenz_96_tracer_advection/work/TESTS/TEST_DRIVER.csh reported by Ben Gunn: (thanks @Benjamin-Gunn !) https://github.com/Benjamin-Gunn/DART/blob/l96_tracer_tests/models/lorenz_96_tracer_advection/work/TESTS/TEST_DRIVER.csh

qceff_table_filename = 'one_below_qceff_table.csv'

&filter_nml inf_flavor = 5, 5,

&model_nml model_size = 120, forcing = 8.0, delta_t = 0.05, mean_velocity = 0.0, pert_velocity_multiplier = 5.0, diffusion_coef = 0.0, e_folding = 0.25, sink_rate = 0.1, source_rate = 100.0, point_tracer_source_rate = 5.0, positive_tracer = .false., bound_above_is_one = .true., time_step_days = 0, time_step_seconds = 3600, /

  1. What was the expected outcome? not expected fix_bound_violations = .true. to be required so often.

  2. What actually happened?
    Failures for "Ensemble member greater than upper bound first check" at various pe counts.

You can set:

&probit_transform_nml fix_bound_violations = .true. /

however, you still get different answers across mpi counts.

#!/bin/bash

module load nco

rm -f one_var_temp.nc
ncrcat -d location,1,1 filter_output.nc one_var_temp.nc
ncks -V -C -v state_variable_mean one_var_temp.nc | tail -3 | head -1 >> test_output
rm -f  one_var_temp.nc

varying pe count: 7.95979093017264 ; 8.02126025256388 ; 8.55748257662756 ;

varying pe count with -fp-model-precise 8.62082489125036 ; 8.62082489125036 ; 8.62082489125036 ;

not sure how different is ok with the varying pe count. Note: I cannot reproduce the bounds violations with -fp-model-precise

Todo @hkershaw intel/2024.0.2, ifx, vs gfortran

Error Message

3 mpi tasks: (also happens with 8,7 (without post_inf), 40(without post_inf))

 PE 0: comp_cov_factor: Standard Gaspari Cohn localization selected
 ERROR FROM:
  source : bnrh_distribution_mod.f90
  routine: bnrh_cdf_initialized
  message:  Ensemble member greater than upper bound first check(see code)   1.00000000000000        1.00000000000000

MPICH ERROR [Rank 0] [job id e35a8d7d-258f-45c5-8d80-ba05433b0be5] [Tue Aug  6 12:24:05 2024] [dec0508] - Abort(99) (rank 0 in comm 496): application called MPI_Abort(comm=0x84000002, 99) - process 0

 ERROR FROM:
  source : bnrh_distribution_mod.f90
  routine: bnrh_cdf_initialized
  message:  Ensemble member greater than upper bound first check(see code)   1.00000000000000        1.00000000000000

MPICH ERROR [Rank 1] [job id e35a8d7d-258f-45c5-8d80-ba05433b0be5] [Tue Aug  6 12:24:05 2024] [dec0508] - Abort(99) (rank 1 in comm 496): application called MPI_Abort(comm=0x84000001, 99) - process 1

 ERROR FROM:
  source : bnrh_distribution_mod.f90
  routine: bnrh_cdf_initialized
  message:  Ensemble member greater than upper bound first check(see code)   1.00000000000000        1.00000000000000

MPICH ERROR [Rank 2] [job id e35a8d7d-258f-45c5-8d80-ba05433b0be5] [Tue Aug  6 12:24:05 2024] [dec0508] - Abort(99) (rank 2 in comm 496): application called MPI_Abort(comm=0x84000001, 99) - process 2

Here is the code: https://github.com/NCAR/DART/blob/75cf8dc9c566221f624ffd4d5eeba9fde5a1757c/assimilation_code/modules/assimilation/bnrh_distribution_mod.f90#L292-L300

Which model(s) are you working with?

lorenz_96_tracer advaction.

/glade/derecho/scratch/hkershaw/DART/Bugs/bgunn_qceff/DART/models/lorenz_96_tracer_advection/work

Version of DART

v11.5.1

Have you modified the DART code?

No

Build information

Please describe:

  1. Derecho
  2. ifort (IFORT) 2021.10.0 20230609
hkershaw-brown commented 3 months ago

no bounds fails with module intel/2024.0.2 (ifort (IFORT) 2021.11.1 20231117) without fp-model precise

8.83596691025763 ;
8.26235748376639 ;
8.41808494868261 ;

no bounds fails with ifx intel-oneapi/2024.0.2 ifx (IFX) 2024.0.2 20231213 without fp-model precise same across core counts.

7.67172341333618 ;
7.67172341333618 ;
7.67172341333618 ;
7.67172341333618 ;
jlaucar commented 3 months ago

Helen, I have a strong sense of deja-vu about this. Have we possibly identified things before where fp-precise was required for various intel versions? Is fix_bound_violations needed to get the cases with fp-precise to run successfully? Do the cases that do not duplicate across PE count duplicate when the same PE count is run repeatedly?

Jeff

On Tue, Aug 6, 2024 at 1:27 PM Helen Kershaw @.***> wrote:

no bounds fails with module intel/2024.0.2 (ifort (IFORT) 2021.11.1 20231117) without fp-model precise 8.83596691025763 ; 8.26235748376639 ; 8.41808494868261 ;

ifx intel-oneapi/2024.0.2 ifx (IFX) 2024.0.2 20231213 without fp-model precise same across core counts.

7.67172341333618 ; 7.67172341333618 ; 7.67172341333618 ; 7.67172341333618 ;

— Reply to this email directly, view it on GitHub https://github.com/NCAR/DART/issues/709#issuecomment-2271992582, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANDHUISJS57B5XLBROQ7YKLZQEPQVAVCNFSM6AAAAABMC6DF4SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENZRHE4TENJYGI . You are receiving this because you are subscribed to this thread.Message ID: @.***>

hkershaw-brown commented 3 months ago

yup this is a reoccurrence of what I was seeing on my old laptop with ifort. It would be cooler if I'd recorded the version on my now dead laptop. I'm trying to see if a can try an older intel version on Derecho.

fix_bounds_violations does not seem to be needed with fp-model precise (haven't got it to fail (yet)) The cases that do not duplicate across PE counts do duplicate with the same PE count

hkershaw-brown commented 3 weeks ago

Note on B Gaubert's cam-chem(?) runs. These were done with fix_bound_violations = .true. rather than fix_bound_violations = .false. as originally thought.

So clamping, rather than probit enforcing the bounds. ( sd == 0 so you never transform into (or back out of) probit space.)

/glade/derecho/scratch/hkershaw/DART/CAM-out-of-bounds/Rean_run is using the reanalysis runs #749

hkershaw-brown commented 3 weeks ago

Note I have not separated out varying results across pe counts (QCEFF vs no QCEFF vs what would be expected).