Increase ≈ tolerance to account for differences across GPU microarchitectures

francispoulin commented 3 years ago

After the big changes yesterday I decided to run the tests to make sure everything was working. Thanks again @ali-ramadhan for helping me get that started.

Now I'm finding that there are some failures and somethings that are broken. See below. Is it just me or do others get this now?

Test Summary:                            | Pass  Fail  Broken  Total
Oceananigans                             | 2987     8       5   3000
  Unit tests                             | 1511             1   1512
  Model and time stepping tests (part 1) |   99                   99
  Model and time stepping tests (part 2) |  214             1    215
  Simulation tests                       | 1142     2       3   1147
    Simulations                          |   26                   26
    Diagnostics                          |   12                   12
    Output writers                       |  409     2            411
      FieldSlicer                        |    1                    1
      WindowedTimeAverage                |    2                    2
      NetCDF [GPU]                       |  198                  198
      JLD2 [GPU]                         |   11                   11
      Checkpointer [GPU]                 |  166     2            168
      Dependency adding [GPU]            |    2                    2
      Time averaging of output [GPU]     |   29                   29
    Abstract operations                  |  695             3    698
  Regression                             |   14     6             20
    Thermal bubble [GPU]                 |    5                    5
    Rayleigh–Bénard tracer [GPU]         |    5                    5
    Ocean large eddy simulation [GPU]    |    4     6             10
  Scripts                                |    7                    7
ERROR: LoadError: Some tests did not pass: 2987 passed, 8 failed, 0 errored, 5 broken.
in expression starting at /home/fpoulin/software/Oceananigans.jl/test/runtests.jl:77
ERROR: Package Oceananigans errored during testing

glwagner commented 3 years ago

After the big changes yesterday I decided to run the tests to make sure everything was working

Are you running branch used in PR #1174 ? The tests don't pass over there --- there's still some work to do. But we should discuss what needs to be done to get the tests to pass over on that PR (and then merge it into master when the tests do pass).

francispoulin commented 3 years ago

I thought I was but I am using master so I guess not.

ali-ramadhan commented 3 years ago

@francispoulin Hope you don't mind that I formatted your original post to put triple backticks (```) around the test results to make them easier to read.

If those tests failed on master, would be good to investigate why... Can you post the full output from running the tests?

Not sure why checkpointer would fail, but I could see some of the regression tests failing if your GPU is very different maybe. Do you know what GPU you used to run the tests with? Might be good for Oceananigans to print this info as part of runtests.jl.

francispoulin commented 3 years ago

Thanks for cleaning it up @ali-ramadhan , and I will try and remember to do this in the future. Looks much better.

The output is pretty long, as you know, but I will copy the highlights below. If you do want to know all the details I can rerun it and probably figure out how to output the result to a file and then maybe include that.

Checkpointer [GPU]: Test Failed at /home/fpoulin/software/Oceananigans.jl/test/test_output_writers.jl:565
  Expression: all(test_model.velocities.u.data .≈ true_model.velocities.u.data)
Stacktrace:
 [1] macro expansion at /home/fpoulin/software/Oceananigans.jl/test/test_output_writers.jl:565 [inlined]
 [2] macro expansion at /home/fpoulin/.julia/packages/GPUArrays/ZxsKE/src/host/indexing.jl:64 [inlined]
 [3] test_model_equality(::IncompressibleModel{QuasiAdamsBashforth2TimeStepper{Float64,NamedTuple{(:u, :v, :w, :T, :S),Tuple{Field{Face,Cell,Cell,OffsetArray{Float64,3,CuArray{Float64,3}},Re

Checkpointer [GPU]: Test Failed at /home/fpoulin/software/Oceananigans.jl/test/test_output_writers.jl:565
  Expression: all(test_model.velocities.u.data .≈ true_model.velocities.u.data)
Stacktrace:
 [1] macro expansion at /home/fpoulin/software/Oceananigans.jl/test/test_output_writers.jl:565 [inlined]
 [2] macro expansion at /home/fpoulin/.julia/packages/GPUArrays/ZxsKE/src/host/indexing.jl:64 [inlined]
 [3] test_model_equality(::IncompressibleModel{QuasiAdamsBashforth2TimeStepper{Float64,NamedTuple{(:u, :v, :w, :T, :S),Tuple{Field{Face,Cell,Cell,OffsetArray{Float64,3,CuArray{Float64,3}},Re

Ocean large eddy simulation [GPU]: Test Failed at /home/fpoulin/software/Oceananigans.jl/test/regression_tests/ocean_large_eddy_simulation_regression_test.jl:122
  Expression: all(test_fields.u .≈ correct_fields.u)
Stacktrace:
 [1] run_ocean_large_eddy_simulation_regression_test(::GPU, ::VerstappenAnisotropicMinimumDissipation{Float64,Float64,Float64,Float64}) at /home/fpoulin/software/Oceananigans.jl/test/regression_tests/ocean_large_eddy_simulation_regression_test.jl:122
 [2] macro expansion at /home/fpoulin/software/Oceananigans.jl/test/test_regression.jl:77 [inlined]
 [3] macro expansion at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Test/src/Test.jl:1115 [inlined]
 [4] macro expansion at /home/fpoulin/software/Oceananigans.jl/test/test_regression.jl:74 [inlined]
 [5] macro expansion at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Test/src/Test.jl:1115 [inlined]
 [6] top-level scope at /home/fpoulin/software/Oceananigans.jl/test/test_regression.jl:60
Ocean large eddy simulation [GPU]: Test Failed at /home/fpoulin/software/Oceananigans.jl/test/regression_tests/ocean_large_eddy_simulation_regression_test.jl:123
  Expression: all(test_fields.v .≈ correct_fields.v)
Stacktrace:
 [1] run_ocean_large_eddy_simulation_regression_test(::GPU, ::VerstappenAnisotropicMinimumDissipation{Float64,Float64,Float64,Float64}) at /home/fpoulin/software/Oceananigans.jl/test/regression_tests/ocean_large_eddy_simulation_regression_test.jl:123
 [2] macro expansion at /home/fpoulin/software/Oceananigans.jl/test/test_regression.jl:77 [inlined]
 [3] macro expansion at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Test/src/Test.jl:1115 [inlined]
 [4] macro expansion at /home/fpoulin/software/Oceananigans.jl/test/test_regression.jl:74 [inlined]
 [5] macro expansion at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Test/src/Test.jl:1115 [inlined]
 [6] top-level scope at /home/fpoulin/software/Oceananigans.jl/test/test_regression.jl:60
Ocean large eddy simulation [GPU]: Test Failed at /home/fpoulin/software/Oceananigans.jl/test/regression_tests/ocean_large_eddy_simulation_regression_test.jl:124
  Expression: all(test_fields.w .≈ correct_fields.w)
Stacktrace:
 [1] run_ocean_large_eddy_simulation_regression_test(::GPU, ::VerstappenAnisotropicMinimumDissipation{Float64,Float64,Float64,Float64}) at /home/fpoulin/software/Oceananigans.jl/test/regression_tests/ocean_large_eddy_simulation_regression_test.jl:124
 [2] macro expansion at /home/fpoulin/software/Oceananigans.jl/test/test_regression.jl:77 [inlined]
 [3] macro expansion at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Test/src/Test.jl:1115 [inlined]
 [4] macro expansion at /home/fpoulin/software/Oceananigans.jl/test/test_regression.jl:74 [inlined]
 [5] macro expansion at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Test/src/Test.jl:1115 [inlined]
 [6] top-level scope at /home/fpoulin/software/Oceananigans.jl/test/test_regression.jl:60
[2020/11/13 11:12:48.923] INFO    Testing oceanic large eddy simulation regression [SmagorinskyLilly, GPU]
[2020/11/13 11:13:08.185] INFO  Δu: min=-5.669004e-10, max=+5.237555e-10, mean=-4.579272e-20, absmean=+3.847342e-12, std=+2.582632e-11 (4064/4096 matching grid points)
[2020/11/13 11:13:08.185] INFO  Δv: min=-5.248542e-10, max=+4.306961e-10, mean=-4.446923e-20, absmean=+3.446188e-12, std=+2.026341e-11 (4081/4096 matching grid points)
[2020/11/13 11:13:08.185] INFO  Δw: min=-8.810476e-10, max=+3.828646e-10, mean=-1.673779e-20, absmean=+2.695421e-12, std=+2.003712e-11 (3987/4096 matching grid points)
[2020/11/13 11:13:08.185] INFO  ΔT: min=-3.171294e-10, max=+1.584823e-09, mean=+1.933831e-12, absmean=+3.283801e-12, std=+4.237465e-11 (4096/4096 matching grid points)
[2020/11/13 11:13:08.185] INFO  ΔS: min=-5.826450e-13, max=+5.613288e-13, mean=-3.816392e-17, absmean=+3.587408e-15, std=+1.934380e-14 (4096/4096 matching grid points)
Ocean large eddy simulation [GPU]: Test Failed at /home/fpoulin/software/Oceananigans.jl/test/regression_tests/ocean_large_eddy_simulation_regression_test.jl:122
  Expression: all(test_fields.u .≈ correct_fields.u)
Stacktrace:

Ocean large eddy simulation [GPU]: Test Failed at /home/fpoulin/software/Oceananigans.jl/test/regression_tests/ocean_large_eddy_simulation_regression_test.jl:122
  Expression: all(test_fields.u .≈ correct_fields.u)
Stacktrace:
 [1] run_ocean_large_eddy_simulation_regression_test(::GPU, ::VerstappenAnisotropicMinimumDissipation{Float64,Float64,Float64,Float64}) at /home/fpoulin/software/Oceananigans.jl/test/regression_tests/ocean_large_eddy_simulation_regression_test.jl:122
 [2] macro expansion at /home/fpoulin/software/Oceananigans.jl/test/test_regression.jl:77 [inlined]
 [3] macro expansion at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Test/src/Test.jl:1115 [inlined]
 [4] macro expansion at /home/fpoulin/software/Oceananigans.jl/test/test_regression.jl:74 [inlined]
 [5] macro expansion at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Test/src/Test.jl:1115 [inlined]
 [6] top-level scope at /home/fpoulin/software/Oceananigans.jl/test/test_regression.jl:60
Ocean large eddy simulation [GPU]: Test Failed at /home/fpoulin/software/Oceananigans.jl/test/regression_tests/ocean_large_eddy_simulation_regression_test.jl:123
  Expression: all(test_fields.v .≈ correct_fields.v)
Stacktrace:
 [1] run_ocean_large_eddy_simulation_regression_test(::GPU, ::VerstappenAnisotropicMinimumDissipation{Float64,Float64,Float64,Float64}) at /home/fpoulin/software/Oceananigans.jl/test/regression_tests/ocean_large_eddy_simulation_regression_test.jl:123
 [2] macro expansion at /home/fpoulin/software/Oceananigans.jl/test/test_regression.jl:77 [inlined]
 [3] macro expansion at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Test/src/Test.jl:1115 [inlined]
 [4] macro expansion at /home/fpoulin/software/Oceananigans.jl/test/test_regression.jl:74 [inlined]
 [5] macro expansion at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Test/src/Test.jl:1115 [inlined]
 [6] top-level scope at /home/fpoulin/software/Oceananigans.jl/test/test_regression.jl:60
Ocean large eddy simulation [GPU]: Test Failed at /home/fpoulin/software/Oceananigans.jl/test/regression_tests/ocean_large_eddy_simulation_regression_test.jl:124
  Expression: all(test_fields.w .≈ correct_fields.w)
Stacktrace:
 [1] run_ocean_large_eddy_simulation_regression_test(::GPU, ::VerstappenAnisotropicMinimumDissipation{Float64,Float64,Float64,Float64}) at /home/fpoulin/software/Oceananigans.jl/test/regression_tests/ocean_large_eddy_simulation_regression_test.jl:124
 [2] macro expansion at /home/fpoulin/software/Oceananigans.jl/test/test_regression.jl:77 [inlined]
 [3] macro expansion at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Test/src/Test.jl:1115 [inlined]
 [4] macro expansion at /home/fpoulin/software/Oceananigans.jl/test/test_regression.jl:74 [inlined]
 [5] macro expansion at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Test/src/Test.jl:1115 [inlined]
 [6] top-level scope at /home/fpoulin/software/Oceananigans.jl/test/test_regression.jl:60
[2020/11/13 11:12:48.923] INFO    Testing oceanic large eddy simulation regression [SmagorinskyLilly, GPU]
[2020/11/13 11:13:08.185] INFO  Δu: min=-5.669004e-10, max=+5.237555e-10, mean=-4.579272e-20, absmean=+3.847342e-12, std=+2.582632e-11 (4064/4096 matching grid points)
[2020/11/13 11:13:08.185] INFO  Δv: min=-5.248542e-10, max=+4.306961e-10, mean=-4.446923e-20, absmean=+3.446188e-12, std=+2.026341e-11 (4081/4096 matching grid points)
[2020/11/13 11:13:08.185] INFO  Δw: min=-8.810476e-10, max=+3.828646e-10, mean=-1.673779e-20, absmean=+2.695421e-12, std=+2.003712e-11 (3987/4096 matching grid points)
[2020/11/13 11:13:08.185] INFO  ΔT: min=-3.171294e-10, max=+1.584823e-09, mean=+1.933831e-12, absmean=+3.283801e-12, std=+4.237465e-11 (4096/4096 matching grid points)
[2020/11/13 11:13:08.185] INFO  ΔS: min=-5.826450e-13, max=+5.613288e-13, mean=-3.816392e-17, absmean=+3.587408e-15, std=+1.934380e-14 (4096/4096 matching grid points)
Ocean large eddy simulation [GPU]: Test Failed at /home/fpoulin/software/Oceananigans.jl/test/regression_tests/ocean_large_eddy_simulation_regression_test.jl:122
  Expression: all(test_fields.u .≈ correct_fields.u)
Stacktrace:

ali-ramadhan commented 3 years ago

Ah nice doesn't look like there's much cause for concern as the regression test "mostly passes" (I copy pasted the relevant portion below).

Almost all the grid points match except for a few percent of the velocity grid points. So I'm guessing this is just a result of different GPUs doing calculations slightly differently maybe? Tiny differences could accumulate over the 100 iterations of the regression test.

I suspect this is the same reason why 2/166 checkpointer tests fail.

Maybe we need to add an absolute tolerance to these tests when checking for ≈?

[2020/11/13 11:13:08.185] INFO  Δu: min=-5.669004e-10, max=+5.237555e-10, mean=-4.579272e-20, absmean=+3.847342e-12, std=+2.582632e-11 (4064/4096 matching grid points)
[2020/11/13 11:13:08.185] INFO  Δv: min=-5.248542e-10, max=+4.306961e-10, mean=-4.446923e-20, absmean=+3.446188e-12, std=+2.026341e-11 (4081/4096 matching grid points)
[2020/11/13 11:13:08.185] INFO  Δw: min=-8.810476e-10, max=+3.828646e-10, mean=-1.673779e-20, absmean=+2.695421e-12, std=+2.003712e-11 (3987/4096 matching grid points)
[2020/11/13 11:13:08.185] INFO  ΔT: min=-3.171294e-10, max=+1.584823e-09, mean=+1.933831e-12, absmean=+3.283801e-12, std=+4.237465e-11 (4096/4096 matching grid points)
[2020/11/13 11:13:08.185] INFO  ΔS: min=-5.826450e-13, max=+5.613288e-13, mean=-3.816392e-17, absmean=+3.587408e-15, std=+1.934380e-14 (4096/4096 matching grid points)

ali-ramadhan commented 3 years ago

@francispoulin Out of curiosity if you run the nvidia-smi command at the terminal does it tell you which GPU(s) you have?

francispoulin commented 3 years ago

Sure thing. See below.

$ nvidia-smi
Fri Nov 13 13:40:39 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 435.21       Driver Version: 435.21       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Quadro P400         Off  | 00000000:65:00.0  On |                  N/A |
| 34%   39C    P0    N/A /  N/A |    755MiB /  1977MiB |      3%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      1880      G   /usr/lib/xorg/Xorg                           428MiB |
|    0      2040      G   /usr/bin/gnome-shell                         187MiB |
|    0      2583      G   ...AAAAAAAAAAAACAAAAAAAAAA= --shared-files   134MiB |
+-----------------------------------------------------------------------------+

ali-ramadhan commented 3 years ago

I don't think I've ever tried running Julia on a workstation GPU like the Quadro P400.

We usually run with a Volta V100 or a Titan V which both share the same Volta microarchitecture so maybe this is why we haven't seen big differences between the two. But I think the P400 uses the Pascal microarchitecture. So this might support the claim that the tests are failing due to accumulation of tiny arithmetic differences between GPUs (since GPUs between microarchitectures might produce slightly different answers).

francispoulin commented 3 years ago

I plan to do much in the way of GPU computing on my desktop with the current system. If the fact that my GPU's are pretty much non-existant is the problem, then that's not a big deal. Just troubling when we see errors but I can ignore them.

I am okay with closing the ticket if this is due to my GPU lacking achitecture.

ali-ramadhan commented 3 years ago

Ah I wouldn't say your GPU lacks any architecture, it's just different from the one we test on. And we don't have access to many different GPUs.

Might be good to leave this issue open until we increase the ≈ tolerance to account for different GPU microarchitectures (at which points tests should pass on your system).

francispoulin commented 3 years ago

Sounds good @ali-ramadhan and I agree that it would be nice to have the software work on various machines, not just the good ones. ;)

I think that using tolerances to measure this makes a lot of sense. I am happy to do testing whenever you like.

glwagner commented 3 years ago

Our GPU CI runs on sverdrup at MIT, which has a Quadro P6000:

https://images.nvidia.com/content/pdf/quadro/data-sheets/192152-NV-DS-Quadro-P6000-US-12Sept-NV-FNL-WEB.pdf

So we do run the regression tests there successfully. But P400 might be different than P6000.

glwagner commented 3 years ago

As a side note we should run benchmarks on GPUs like the P400 and P6000, since I think they may actually benefit from Float32 (unlike calculations on Volta chips, which have plenty of double precision capability).

CliMA / Oceananigans.jl

Increase ≈ tolerance to account for differences across GPU microarchitectures #1179