Open hkershaw-brown opened 3 months ago
no bounds fails with module intel/2024.0.2 (ifort (IFORT) 2021.11.1 20231117) without fp-model precise
8.83596691025763 ;
8.26235748376639 ;
8.41808494868261 ;
no bounds fails with ifx intel-oneapi/2024.0.2 ifx (IFX) 2024.0.2 20231213 without fp-model precise same across core counts.
7.67172341333618 ;
7.67172341333618 ;
7.67172341333618 ;
7.67172341333618 ;
Helen, I have a strong sense of deja-vu about this. Have we possibly identified things before where fp-precise was required for various intel versions? Is fix_bound_violations needed to get the cases with fp-precise to run successfully? Do the cases that do not duplicate across PE count duplicate when the same PE count is run repeatedly?
Jeff
On Tue, Aug 6, 2024 at 1:27 PM Helen Kershaw @.***> wrote:
no bounds fails with module intel/2024.0.2 (ifort (IFORT) 2021.11.1 20231117) without fp-model precise 8.83596691025763 ; 8.26235748376639 ; 8.41808494868261 ;
ifx intel-oneapi/2024.0.2 ifx (IFX) 2024.0.2 20231213 without fp-model precise same across core counts.
7.67172341333618 ; 7.67172341333618 ; 7.67172341333618 ; 7.67172341333618 ;
— Reply to this email directly, view it on GitHub https://github.com/NCAR/DART/issues/709#issuecomment-2271992582, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANDHUISJS57B5XLBROQ7YKLZQEPQVAVCNFSM6AAAAABMC6DF4SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENZRHE4TENJYGI . You are receiving this because you are subscribed to this thread.Message ID: @.***>
yup this is a reoccurrence of what I was seeing on my old laptop with ifort. It would be cooler if I'd recorded the version on my now dead laptop. I'm trying to see if a can try an older intel version on Derecho.
fix_bounds_violations does not seem to be needed with fp-model precise (haven't got it to fail (yet)) The cases that do not duplicate across PE counts do duplicate with the same PE count
Note on B Gaubert's cam-chem(?) runs. These were done with fix_bound_violations = .true. rather than fix_bound_violations = .false. as originally thought.
So clamping, rather than probit enforcing the bounds. ( sd == 0 so you never transform into (or back out of) probit space.)
/glade/derecho/scratch/hkershaw/DART/CAM-out-of-bounds/Rean_run is using the reanalysis runs #749
Note I have not separated out varying results across pe counts (QCEFF vs no QCEFF vs what would be expected).
:bug: Your bug may already be reported! Please search on the issue tracker before creating a new issue.
Describe the bug
/glade/derecho/scratch/hkershaw/DART/Bugs/bgunn_qceff/DART/models/lorenz_96_tracer_advection/work Following https://github.com/NCAR/DART/blob/l96_tracer_tests/models/lorenz_96_tracer_advection/work/TESTS/TEST_DRIVER.csh reported by Ben Gunn: (thanks @Benjamin-Gunn !) https://github.com/Benjamin-Gunn/DART/blob/l96_tracer_tests/models/lorenz_96_tracer_advection/work/TESTS/TEST_DRIVER.csh
qceff_table_filename = 'one_below_qceff_table.csv'
&filter_nml inf_flavor = 5, 5,
&model_nml model_size = 120, forcing = 8.0, delta_t = 0.05, mean_velocity = 0.0, pert_velocity_multiplier = 5.0, diffusion_coef = 0.0, e_folding = 0.25, sink_rate = 0.1, source_rate = 100.0, point_tracer_source_rate = 5.0, positive_tracer = .false., bound_above_is_one = .true., time_step_days = 0, time_step_seconds = 3600, /
What was the expected outcome? not expected
fix_bound_violations = .true.
to be required so often.What actually happened?
Failures for "Ensemble member greater than upper bound first check" at various pe counts.
You can set:
&probit_transform_nml fix_bound_violations = .true. /
however, you still get different answers across mpi counts.
varying pe count: 7.95979093017264 ; 8.02126025256388 ; 8.55748257662756 ;
varying pe count with -fp-model-precise 8.62082489125036 ; 8.62082489125036 ; 8.62082489125036 ;
not sure how different is ok with the varying pe count. Note: I cannot reproduce the bounds violations with -fp-model-precise
Todo @hkershaw intel/2024.0.2, ifx, vs gfortran
Error Message
3 mpi tasks: (also happens with 8,7 (without post_inf), 40(without post_inf))
Here is the code: https://github.com/NCAR/DART/blob/75cf8dc9c566221f624ffd4d5eeba9fde5a1757c/assimilation_code/modules/assimilation/bnrh_distribution_mod.f90#L292-L300
Which model(s) are you working with?
lorenz_96_tracer advaction.
/glade/derecho/scratch/hkershaw/DART/Bugs/bgunn_qceff/DART/models/lorenz_96_tracer_advection/work
Version of DART
v11.5.1
Have you modified the DART code?
No
Build information
Please describe: