CliMA / ClimaAtmos.jl

ClimaAtmos.jl is a library for building atmospheric circulation models that is designed from the outset to leverage data assimilation and machine learning tools. We welcome contributions!
Apache License 2.0
84 stars 17 forks source link

Random MSE failure in buildkite #2212

Closed szy21 closed 8 months ago

szy21 commented 1 year ago

There have been at least two random MSE failures recently, see these builds: https://buildkite.com/clima/climaatmos-ci/builds/13617 https://buildkite.com/clima/climaatmos-ci/builds/13702

@charleskawczynski Not sure if we need to worry about them?

szy21 commented 1 year ago

Another one: https://buildkite.com/clima/climaatmos-ci/builds/13803

charleskawczynski commented 1 year ago

Yeah, I'm not sure what's going on. I'd say it's an issue with the gravity wave parameterization, but 💻 SSP zalesak tracer & energy upwind baroclinic wave (ρe_tot) equilmoist doesn't have GW in it.

szy21 commented 1 year ago

Right, I was confused about this too. Let's see if there will be more occurrences and maybe there will be some clues.

charleskawczynski commented 1 year ago

If it's starting to cause CI troubles, we can make those jobs opt out of the mse tests

szy21 commented 1 year ago

https://buildkite.com/clima/climaatmos-ci/builds/13841

LenkaNovak commented 1 year ago

Random coupler nondeterministic break (https://buildkite.com/clima/climacoupler-ci/builds/1917#018b2ad3-4642-401d-9cb7-0f56c95ea777)

szy21 commented 1 year ago

Random coupler nondeterministic break (https://buildkite.com/clima/climacoupler-ci/builds/1917#018b2ad3-4642-401d-9cb7-0f56c95ea777)

Do you know if this case failed immediately? The input (e_int=-1.2040901047376792e35) looks like something not initialized.

simonbyrne commented 1 year ago

Trying to see if i can reproduce failures here: https://buildkite.com/clima/climaatmos-ci/builds/13918

Sbozzolo commented 1 year ago

I checked the day0.0.h5 file against a trusted solution and found this:

dataset: </fields/Y/c> and </fields/Y/c>
22115 differences found
dataset: </fields/diagnostics/kinetic_energy> and </fields/diagnostics/kinetic_energy>
4408 differences found
dataset: </fields/diagnostics/potential_temperature> and </fields/diagnostics/potential_temperature>
1 differences found
dataset: </fields/diagnostics/pressure> and </fields/diagnostics/pressure>
1 differences found
dataset: </fields/diagnostics/relative_humidity> and </fields/diagnostics/relative_humidity>
1 differences found
dataset: </fields/diagnostics/sfc_evaporation> and </fields/diagnostics/sfc_evaporation>
300 differences found
dataset: </fields/diagnostics/sfc_flux_energy> and </fields/diagnostics/sfc_flux_energy>
290 differences found
dataset: </fields/diagnostics/sfc_flux_u> and </fields/diagnostics/sfc_flux_u>
545 differences found
dataset: </fields/diagnostics/sfc_flux_v> and </fields/diagnostics/sfc_flux_v>
2060 differences found
dataset: </fields/diagnostics/specific_enthalpy> and </fields/diagnostics/specific_enthalpy>
3 differences found
dataset: </fields/diagnostics/temperature> and </fields/diagnostics/temperature>
1 differences found
dataset: </fields/diagnostics/u_velocity> and </fields/diagnostics/u_velocity>
2852 differences found
dataset: </fields/diagnostics/v_velocity> and </fields/diagnostics/v_velocity>
19730 differences found
dataset: </fields/diagnostics/vorticity> and </fields/diagnostics/vorticity>
17124 differences found

Therefore, I think we can conclude that there's something wrong already at initialization.

Sbozzolo commented 1 year ago

More details attached diffY.txt

Pretty much every point is wrong

szy21 commented 1 year ago

Which two files are you comparing?

Sbozzolo commented 1 year ago

Which two files are you comparing?

Ah, right.

I am comparing the output of SSP zalesk tracer & energy from your first link, and the next available build that passes (https://buildkite.com/clima/climaatmos-ci/builds/13620). I also checked against the latest passing master branch (which is identical to build 13620)

szy21 commented 1 year ago

Good to know. That's interesting, thanks!

simonbyrne commented 1 year ago

Looks like it is something that is node-specific: on https://buildkite.com/clima/climaatmos-ci/builds/13918, I was able to reproduce the earlier failures by requesting the same nodes. My guess would be slightly different micro architecture, which resulted in different floating point operations.

simonbyrne commented 1 year ago

Yes, it looks like those two nodes are broadwell, whereas we would usually run on the newer skylake/icelake ones.

szy21 commented 1 year ago

Yes, it looks like those two nodes are broadwell, whereas we would usually run on the newer skylake/icelake ones.

Should we specify CPU target in ci?

simonbyrne commented 1 year ago

We should figure out what is going on, and set our error tolerances to not depend on the vagaries of floating point computation

LenkaNovak commented 1 year ago

Random coupler nondeterministic break (https://buildkite.com/clima/climacoupler-ci/builds/1917#018b2ad3-4642-401d-9cb7-0f56c95ea777)

Do you know if this case failed immediately? The input (e_int=-1.2040901047376792e35) looks like something not initialized.

The coupling loop had started, meaning the model initialized and stepped at least once without crashing. Not sure about the exact sim time though. 🤔

szy21 commented 1 year ago

The coupling loop had started, meaning the model initialized and stepped at least once without crashing. Not sure about the exact sim time though. 🤔

It seems strange that floating point error would cause a simulation to crash with such unphysical values (all the atmos simulations only have MSE changes), but maybe you are just close to the unstable regime.

LenkaNovak commented 1 year ago

The coupling loop had started, meaning the model initialized and stepped at least once without crashing. Not sure about the exact sim time though. 🤔

It seems strange that floating point error would cause a simulation to crash with such unphysical values (all the atmos simulations only have MSE changes), but maybe you are just close to the unstable regime.

Right. We have retained some coarse-resolution benchmarks for tracking behavior changes, like this heisenbug 🐛

simonbyrne commented 1 year ago

@Sbozzolo tracked down the difference to the metric terms. I'll try to dig into it.

simonbyrne commented 1 year ago

From what I can tell, the ultimate cause appears to be some of the small matrix multiplications when it computes the metric terms: each architecture uses a slightly different combination of fused-multiply-add (FMA) operations which gives slightly different results.

In general, the fixes should be both:

Sbozzolo commented 1 year ago

From what I can tell, the ultimate cause appears to be some of the small matrix multiplications when it computes the metric terms: each architecture uses a slightly different combination of fused-multiply-add (FMA) operations which gives slightly different results.

My gut feeling is that there's more to the problem than this. 1) this problem has appeared only recently, 2) the error in some terms can be significant, 3) the geometry produced on skylake and my laptop (Intel 13th gen, 7 generations later) are bitwise identical, but they are different from the broadwell case. (To be noted that skylake has AVX512 but broadwell has only AVX2)

Maybe we should check:

If the error is indeed purely floating point, we can also consider some summation techniques to avoid catastrophic cancellations

szy21 commented 1 year ago

On that note, I did get different results when running ClimaAtmos on my local computer and on buildkite, at least for some cases.

simonbyrne commented 1 year ago

Is the difference physically meaningful? I thought we're just seeing it now since apparently we've changed to testing for exact results, rather than requiring tolerances. Or is it genuinely new?

If the error is indeed purely floating point, we can also consider some summation techniques to avoid catastrophic cancellations

Hence my suggestion for computing the metrics in higher precision (which we only need to do at initialization)

szy21 commented 1 year ago

We have always been testing for exact results through regression tests and MSE.

simonbyrne commented 1 year ago

Well we did a while ago: https://github.com/CliMA/ClimaAtmos.jl/commit/b3a6ce8aacb2a25700678fc28c8973f282c086f5#diff-67efda79cf46ae8a556ad5861439de31e3b518f94b6695fe6d7f128cfc7ea890 but that was 6 months ago.

I guess we could bisect to see if it is a recent change.

szy21 commented 1 year ago

Well we did a while ago: b3a6ce8#diff-67efda79cf46ae8a556ad5861439de31e3b518f94b6695fe6d7f128cfc7ea890 but that was 6 months ago.

I guess we could bisect to see if it is a recent change.

Ah, ok, I guess before we were not doing this consistently - some people will change the MSE values while others will set it to zero. Only recently we have always set MSE to zero.

Sbozzolo commented 1 year ago

On the cluster, with broadwell, julia 1.8.5, and the attached Manifest, the example above produces bitwise identical values as the other nodes.

I am checking what happens with julia 1.9.3

Manifest.toml.txt

Sbozzolo commented 1 year ago

On 1.9.3, with the same manifest file, the output HDF5 files are no longer identical. However, I saved a bunch of relevant quantities (Y, geometry) with JLD2 and found that they are still bitwise identical.

I also printed out to STDOUT some values to check if they are the same, and this is what they look like: image

So, this seem to affect println and HDF5 but not julia itself. Maybe something has changed for floating point numbers in julia 1.9/LLVM 14 on this particular architecture, but it doesn't seem to affect anything important (at least at initialization).

simonbyrne commented 1 year ago

@Sbozzolo can you explain a bit more about what is different?

Sbozzolo commented 1 year ago

@Sbozzolo can you explain a bit more about what is different?

For Julia 1.8.5 + Broadwell node and Julia 1.9.3 + Skylake/13th gen Intel everything is bitwise identical, where everything is:

  1. parent(integrator.u)
  2. day0.0.hdf5 file
  3. parent(axes(integrator.u).center_local_geometry
  4. values printed with write in the initial_condition function (open("new_extrema", "a") do file write(file, "$z $lat $long $p $T $u $v\n")

For Julia 1.9.3 + broadwell, 1/3 are still identical, but 2/4 are not, as shown from the screenshot above (ediff) and from h5diff on the day0.0.hdf5 file (as shown above). I checked the other two by saving the values with JLD2 from the cluster and loading them locally.