CEED / libCEED

CEED Library: Code for Efficient Extensible Discretizations
https://libceed.org
BSD 2-Clause "Simplified" License
199 stars 46 forks source link

SYCL failures for fluids example on SunSpot #1603

Open jrwrigh opened 4 months ago

jrwrigh commented 4 months ago

I've seen some failures on SunSpot with the fluids examples. The general behavior is:

The failures are only present on a few tests (SunSpot is down for maintenance today, so I can't confirm which ones exactly right now, but I'm fairly certain the Gaussian wave tests are one of them), but the above behavior is pretty consistent. This is observed using the oneapi/release/2024.04.15.001.

The failure specifically is a non-linear solver divergence:

[0]PETSC ERROR: --------------------- Error Message --------------------------------------------------------------
[0]PETSC ERROR: TSStep has failed due to DIVERGED_NONLINEAR_SOLVE, increase -ts_max_snes_failures or make negative to attempt recovery

Given the relationships between the backends, I'm guessing the error is probably in the shared functions between the shared and gen backends.

Tagging @kris-rowe @uumesh

jrwrigh commented 2 weeks ago

Running on Sunspot with the following environment:

Currently Loaded Modules:
  1) spack-pe-gcc/0.7.0-24.086.0   5) gcc/12.2.0                             9) oneapi/eng-compiler/2024.04.15.002  13) fmt/8.1.1-enhyvzg              17) re2/2023-09-01-7s7ikri        21) bear/3.0.20
  2) gmp/6.2.1-pcxzkau             6) mpich/icc-all-pmix-gpu/20231026       10) libfabric/1.15.2.0                  14) abseil-cpp/20230125.3-af6loxb  18) grpc/1.44.0-xoe6dyh           22) tmux/3.3a
  3) mpfr/4.2.0-w7v7yjv            7) mpich-config/collective-tuning/1024   11) cray-pals/1.3.3                     15) c-ares/1.15.0-bvkwg2y          19) nlohmann-json/3.11.2-ejousvp  23) cmake/3.27.7
  4) mpc/1.3.1-dfagrna             8) intel_compute_runtime/release/821.36  12) cray-libpals/1.3.3                  16) protobuf/3.21.12               20) spdlog/1.10.0-g3jfctv

Note: these tests are actually done with HONEE rather than the fluids example, but the tests are nearly identical between the two

I'm getting errors on the following tests (and their respective backends):

Test: navierstokes Advection 2D, implicit square wave, direct div(F_diff): /gpu/sycl/shared
Test: navierstokes Advection 2D, explicit square wave, indirect div(F_diff): /gpu/sycl/shared
Test: navierstokes Gaussian Wave, IDL and Entropy variables: /gpu/sycl/shared
Test: navierstokes Blasius, SGS DataDriven Sequential Ceed: /gpu/sycl/shared
Test: navierstokes Gaussian Wave, explicit, supg, IDL: /gpu/sycl/shared
Test: navierstokes Advection 2D, rotation, explicit, supg, consistent mass: /gpu/sycl/gen
Test: navierstokes Advection, skew: /gpu/sycl/shared
Test: navierstokes Blasius, bc_slip, Indirect Diffusive Flux Projection: /gpu/sycl/shared
Test: navierstokes Blasius, bc_slip, Direct Diffusive Flux Projection: /gpu/sycl/shared
Test: navierstokes Advection, rotation, cosine, direct div(F_diff): /gpu/sycl/shared
Test: navierstokes Gaussian Wave, using MatShell: /gpu/sycl/shared
Test: navierstokes Blasius, SGS DataDriven Fused: /gpu/sycl/shared
Test: navierstokes Blasius, SGS DataDriven Fused: /gpu/sycl/gen
Test: navierstokes Blasius, Anisotropic Differential Filter: /gpu/sycl/shared
Test: navierstokes Blasius, Anisotropic Differential Filter: /gpu/sycl/gen
Test: navierstokes Blasius, Isotropic Differential Filter: /gpu/sycl/shared
Test: navierstokes Blasius, Isotropic Differential Filter: /gpu/sycl/gen
Test: navierstokes Gaussian Wave, with IDL: /gpu/sycl/shared
Test: navierstokes Spanwise Turbulence Statistics: /gpu/sycl/shared
Test: navierstokes Spanwise Turbulence Statistics: /gpu/sycl/gen
Test: navierstokes Blasius: /gpu/sycl/shared
Test: navierstokes Blasius, STG Inflow: /gpu/sycl/shared
Test: navierstokes Blasius, STG Inflow, Weak Temperature: /gpu/sycl/shared
Test: navierstokes Blasius, Strong STG Inflow: /gpu/sycl/shared
Test: navierstokes Channel: /gpu/sycl/gen
Test: navierstokes Channel, Primitive: /gpu/sycl/gen
Test: navierstokes Density Current, explicit: /gpu/sycl/shared
Test: navierstokes Density Current, implicit, no stabilization: /gpu/sycl/shared
Test: navierstokes Advection, rotation, implicit, SUPG stabilization: /gpu/sycl/shared
Test: navierstokes Advection 2D, rotation, explicit, strong form: /gpu/sycl/gen
Test: navierstokes Euler, explicit: /gpu/sycl/shared
Test: navierstokes Sod Shocktube, explicit, SU stabilization, y-z-beta shock capturing: /gpu/sycl/shared
Test: navierstokes Sod Shocktube, explicit, SU stabilization, y-z-beta shock capturing: /gpu/sycl/gen

The failures are inconsistent. On back-to-back runs, I see the following failure differences:

$ diff junit2_failure_names.log junit_failure_names.log
5a6
> Test: navierstokes Advection 2D, rotation, explicit, supg, consistent mass: /gpu/sycl/gen
13a15,17
> Test: navierstokes Blasius, Anisotropic Differential Filter: /gpu/sycl/gen
> Test: navierstokes Blasius, Isotropic Differential Filter: /gpu/sycl/shared
> Test: navierstokes Blasius, Isotropic Differential Filter: /gpu/sycl/gen
15a20
> Test: navierstokes Spanwise Turbulence Statistics: /gpu/sycl/gen
20c25,26
< Test: navierstokes Channel: /gpu/sycl/shared
---
> Test: navierstokes Channel: /gpu/sycl/gen
> Test: navierstokes Channel, Primitive: /gpu/sycl/gen
21a28
> Test: navierstokes Density Current, implicit, no stabilization: /gpu/sycl/shared
23,24d29
< Test: navierstokes Advection, translation, implicit, SU stabilization: /gpu/sycl/shared
< Test: navierstokes Advection 2D, rotation, explicit, strong form: /gpu/sycl/shared
26d30
< Test: navierstokes Advection 2D, rotation, implicit, SUPG stabilization: /gpu/sycl/shared

I've attached the make junit results here: junit.log junit2.log

Most of the failures are:

TSStep has failed due to DIVERGED_NONLINEAR_SOLVE,

Some fail when comparing to the reference solution. Of note on those is the following:

Test: navierstokes Gaussian Wave, explicit, supg, IDL
  $ build/navierstokes -ceed /gpu/sycl/shared -test_type solver -options_file examples/gaussianwave.yaml -compare_final_state_atol 1e-8 -compare_final_state_filename tests/output/fluids-navierstokes-gaussianwave-explicit.bin -dm_plex_box_faces 2,2,1 -ts_max_steps 5 -degree 3 -implicit false -ts_type rk -stab supg -state_var conservative -mass_ksp_type gmres -mass_pc_jacobi_type diagonal -idl_decay_time 2e-3 -idl_length 0.25 -idl_start 0 -idl_pressure 70
FAIL: stdout
Output:
Test failed with error norm 1.7366e+142

I say of note because:

Now the error norm might simply be due to numerical instabilities in the solution (it is explicit after all), but perhaps this problem in particular might illuminate the problems better.

uumesh commented 2 weeks ago

Thanks for the notes. We will revisit the implementation of the kernels in the shared and gen backends. In the meantime, do you know if these tests also fail on the libCEED (fluids) side. Also worth checking are the ex1 and ex2 test cases in libCEED and if they pass with these backends.

jrwrigh commented 2 weeks ago

It's the same behavior for the fluids example tests, minus the tests that are in HONEE and not libCEED.

ex1 and ex2 pass fine.

jeremylt commented 2 weeks ago

note - the SYCL backends also need a ton of updates from the CUDA/HIP backends, so that might be a more worthwhile usage of time since those are such extensive changes

jrwrigh commented 1 week ago

Per suggestion of @nbeams , I tried the libCEED tests with export ZE_SERIALIZE=2 and they pass with this environment variable set. TBH, I'm not sure what it does, but I'm guessing it disallows some form of out-of-order execution.

nbeams commented 1 week ago

ZE_SERIALIZE=2 forces all kernel launches to be serialized with respect to the host. Since that seems to fix the problem, that points to a sync issue somewhere being the source of the failures. Unfortunately, it doesn't help us narrow down where it's coming from...

uumesh commented 1 week ago

@nbeams - when you say serialize the kernel launches, is that equivalent to meaning in-order execution of the queue? If that is the case, it might be easier to look for where we might have missed a queue synchronization.

nbeams commented 1 week ago

It would make the kernels in-order, but I think it also means the kernel launches are blocking. I've been told it's like doing CUDA_LAUNCH_BLOCKING=1.