CliMA / ClimaAtmos.jl

ClimaAtmos.jl is a library for building atmospheric circulation models that is designed from the outset to leverage data assimilation and machine learning tools. We welcome contributions!
Apache License 2.0
75 stars 14 forks source link

Several weekly GPU longruns are broken #2934

Closed szy21 closed 1 month ago

szy21 commented 4 months ago

[build]((https://buildkite.com/clima/climaatmos-gpulongruns/builds/253) on 6/14 longrun_aquaplanet_clearsky_1M: unstable after ~17 weeks. t_end was extended to 120 days for this run so it is broken. Status: not fixed longrun_aquaplanet_rhoe_equil_55km_nz63_clearsky_tvinsol_0M_slabocean: stable, but conservation test failed due to the change to SSP. Status: fixed by #3159, which ensures the surface state uses the correct precipitation tendency.

build on 6/14 longrun_aquaplanet_clearsky_1M: unstable after ~5 days. It seems 1M doesn't have a regression test now. Maybe related to #3084? Status: fixed by #3095. Now it runs for longer and becomes unstable eventually, which is a separate issue.

build on 6/7 longrun_bw_rhoe_equil_highres. This is from PR #3074 which removes the reference state. Reordering the u3 tendency to make it similar to before in this commit fixes it. This suggests that we may be close to the instability regime. We will see if it works with SSPKnoth. Status: fixed by switching to SSP longrun_aquaplanet_rhoe_equil_55km_nz63_clearsky_0M_earth. This is from PR #3074 which removes the reference state. Status: not fixed

build on 5/27 longrun_sphere_hydrostatic_balance_rhoe. This is from a change to a higher horizontal and vertical resolution (PR #3028) Status: fixed by switching to SSP longrun_aquaplanet_rhoe_equil_55km_nz63_gray_0M. PR #2855 is the only one that changes MSE this week. Status: fixed by fixing a bug in PR #2855

build on 4/19 longrun_aquaplanet_rhoe_equil_55km_nz63_clearsky_0M_earth_sleve. cc @akshaysridhar . Changing SST to a function of surface height results in behavior changes in this job. Status: not fixed longrun_aquaplanet_clearsky_1M. cc @trontrytel . I don't know why this job has behavior changes. Status: fixed (as in, it works now, I don't know why)

Sbozzolo commented 3 months ago

SSP baroclinic wave also seems consistenly broken: https://buildkite.com/clima/climaatmos-longruns/builds/1572#018f6607-605b-447a-b511-1111a4869090

szy21 commented 3 months ago

SSP baroclinic wave also seems consistenly broken: https://buildkite.com/clima/climaatmos-longruns/builds/1572#018f6607-605b-447a-b511-1111a4869090

Yeah, we have been ignoring this one. Hope we don't need it anymore after SSPKnoth:)

trontrytel commented 3 months ago

Would it make sense to look at all of those cases after SSPKnoth lands?

szy21 commented 3 months ago

Would it make sense to look at all of those cases after SSPKnoth lands?

Yes, I just tagged you to let you know there is an open issue. Also make sure they don't break diagnostic EDMF:)

akshaysridhar commented 3 months ago

We will revisit this after SSPKnoth; I'm adding some fixes to a bycolumn op (in the use of column_mapreduce) in our cached vars to see if I can fix the 2 breaking cases as of Friday's run (these were both working as of April 12)

Sbozzolo commented 2 months ago

I found that PR https://github.com/CliMA/ClimaAtmos.jl/pull/2855 breaks one run that works otherwise with SSPKnoth

szy21 commented 2 months ago

I found that PR #2855 breaks one run that works otherwise with SSPKnoth

Interesting, so that PR breaks two of the working runs, although I see no reason that it should cause physical changes. What is the job that failed?

Sbozzolo commented 2 months ago

I found that PR #2855 breaks one run that works otherwise with SSPKnoth

Interesting, so that PR breaks two of the working runs, although I see no reason that it should cause physical changes. What is the job that failed?

https://buildkite.com/clima/climaatmos-ci/builds/19073#018fe547-837c-4eab-a70f-2691b203a9fa

szy21 commented 2 months ago

Could someone take another look at the PR https://github.com/CliMA/ClimaAtmos.jl/pull/2855 to see if the change in MSE is only from the change in the order of computation? Maybe @juliasloan25?

Sbozzolo commented 2 months ago

I am looking at it right now because it is blocking my work on Rosenbrock.

Sbozzolo commented 2 months ago

I took one integration step (for the job that is failing for me) and compared before and after the patch. The results are very different:

julia> working.integrator.p.precomputed.sfc_conditions.ρ_flux_uₕ
ClimaCore.Geometry.AxisTensor{Float32, 2, Tuple{ClimaCore.Geometry.CovariantAxis{(3,)}, ClimaCore.Geometry.CovariantAxis{(1, 2)}}, StaticArraysCore.SMatrix{1, 2, Float32, 2}}-valued Field:
  components: 
    data: 
      1: Float32[-319.072, -378.13, -442.548, -463.268, -337.013, -383.945, -441.563, -477.315, -352.969, -396.443  …  -384.247, -446.425, -296.626, -348.484, -427.235, -464.666, -319.072, -378.13, -442.548, -463.268]
      2: Float32[-28.7554, -25.6094, -20.8332, -12.6396, -32.5333, -30.2907, -28.3866, -19.9347, -42.1406, -43.148  …  -392.587, -418.496, -344.003, -365.153, -395.624, -399.4, -347.827, -377.507, -388.069, -371.485]

julia> greg.integrator.p.precomputed.sfc_conditions.ρ_flux_uₕ
ClimaCore.Geometry.AxisTensor{Float32, 2, Tuple{ClimaCore.Geometry.CovariantAxis{(3,)}, ClimaCore.Geometry.CovariantAxis{(1, 2)}}, StaticArraysCore.SMatrix{1, 2, Float32, 2}}-valued Field:
  components: 
    data: 
      1: Float32[-695.62, -742.976, -759.328, -734.481, -681.863, -710.018, -727.113, -730.004, -653.286, -684.591  …  -596.337, -668.212, -443.825, -503.773, -595.453, -632.766, -444.601, -514.036, -581.144, -592.805]
      2: Float32[347.81, 345.716, 315.054, 284.463, 306.303, 295.782, 269.1, 252.041, 241.099, 233.439  …  -180.496, -189.359, -199.373, -209.864, -220.374, -218.469, -222.301, -239.188, -241.123, -229.592]
szy21 commented 2 months ago

I suggest that we revert it, unless @akshaysridhar or @glwagner understand why there is a substantial change in surface fluxes from that PR?

szy21 commented 2 months ago

@Sbozzolo Could you check to see if ρ_flux_h_tot and ρ_flux_q_tot are very different as well?

Sbozzolo commented 2 months ago

Steps to reproduce:

First, checkout commit before/after change, e.g.

git checkout b51150ee4 

Then

import ClimaAtmos as CA
import SciMLBase
working = CA.get_simulation(CA.AtmosConfig("config/mpi_configs/mpi_sphere_aquaplanet_rhoe_equilmoist_clearsky.yml"))
SciMLBase.step!(working.integrator)
println(working.integrator.p.precomputed.sfc_conditions.ρ_flux_uₕ)
Sbozzolo commented 2 months ago

@Sbozzolo Could you check to see if ρ_flux_h_tot and ρ_flux_q_tot are very different as well?

After the first step, they are the same.

Sbozzolo commented 2 months ago

I suggest that we revert it, unless @akshaysridhar or @glwagner understand why there is a substantial change in surface fluxes from that PR?

(Maybe @dennisYatunin too)

szy21 commented 2 months ago

wait I think there is a bug in that PR. https://github.com/CliMA/ClimaAtmos.jl/blob/bb9c010f90ccf6acc9f5e3cd0480a6e273be764d/src/surface_conditions/surface_conditions.jl#L421 here is should be f = C12(f₁₃ * xz + f₂₃ * yz, L)?

szy21 commented 2 months ago

@Sbozzolo could you try if that fixes the issue? (And we should somehow prevent this from happening in the future...)

akshaysridhar commented 2 months ago

wait I think there is a bug in that PR.

https://github.com/CliMA/ClimaAtmos.jl/blob/bb9c010f90ccf6acc9f5e3cd0480a6e273be764d/src/surface_conditions/surface_conditions.jl#L421 here is should be f = C12(f₁₃ * xz + f₂₃ * yz, L)?

Yes, this is a bug

Sbozzolo commented 2 months ago

Will try this now.

Sbozzolo commented 2 months ago

Yes, that works!

Sbozzolo commented 2 months ago

Opening a PR now

szy21 commented 1 month ago

The longrun experiments have been cleaned up and most of these are irrelevant now. Closing the issue for now.