Open jrwrigh opened 4 months ago
Running on Sunspot with the following environment:
Currently Loaded Modules:
1) spack-pe-gcc/0.7.0-24.086.0 5) gcc/12.2.0 9) oneapi/eng-compiler/2024.04.15.002 13) fmt/8.1.1-enhyvzg 17) re2/2023-09-01-7s7ikri 21) bear/3.0.20
2) gmp/6.2.1-pcxzkau 6) mpich/icc-all-pmix-gpu/20231026 10) libfabric/1.15.2.0 14) abseil-cpp/20230125.3-af6loxb 18) grpc/1.44.0-xoe6dyh 22) tmux/3.3a
3) mpfr/4.2.0-w7v7yjv 7) mpich-config/collective-tuning/1024 11) cray-pals/1.3.3 15) c-ares/1.15.0-bvkwg2y 19) nlohmann-json/3.11.2-ejousvp 23) cmake/3.27.7
4) mpc/1.3.1-dfagrna 8) intel_compute_runtime/release/821.36 12) cray-libpals/1.3.3 16) protobuf/3.21.12 20) spdlog/1.10.0-g3jfctv
Note: these tests are actually done with HONEE rather than the fluids example, but the tests are nearly identical between the two
I'm getting errors on the following tests (and their respective backends):
Test: navierstokes Advection 2D, implicit square wave, direct div(F_diff): /gpu/sycl/shared
Test: navierstokes Advection 2D, explicit square wave, indirect div(F_diff): /gpu/sycl/shared
Test: navierstokes Gaussian Wave, IDL and Entropy variables: /gpu/sycl/shared
Test: navierstokes Blasius, SGS DataDriven Sequential Ceed: /gpu/sycl/shared
Test: navierstokes Gaussian Wave, explicit, supg, IDL: /gpu/sycl/shared
Test: navierstokes Advection 2D, rotation, explicit, supg, consistent mass: /gpu/sycl/gen
Test: navierstokes Advection, skew: /gpu/sycl/shared
Test: navierstokes Blasius, bc_slip, Indirect Diffusive Flux Projection: /gpu/sycl/shared
Test: navierstokes Blasius, bc_slip, Direct Diffusive Flux Projection: /gpu/sycl/shared
Test: navierstokes Advection, rotation, cosine, direct div(F_diff): /gpu/sycl/shared
Test: navierstokes Gaussian Wave, using MatShell: /gpu/sycl/shared
Test: navierstokes Blasius, SGS DataDriven Fused: /gpu/sycl/shared
Test: navierstokes Blasius, SGS DataDriven Fused: /gpu/sycl/gen
Test: navierstokes Blasius, Anisotropic Differential Filter: /gpu/sycl/shared
Test: navierstokes Blasius, Anisotropic Differential Filter: /gpu/sycl/gen
Test: navierstokes Blasius, Isotropic Differential Filter: /gpu/sycl/shared
Test: navierstokes Blasius, Isotropic Differential Filter: /gpu/sycl/gen
Test: navierstokes Gaussian Wave, with IDL: /gpu/sycl/shared
Test: navierstokes Spanwise Turbulence Statistics: /gpu/sycl/shared
Test: navierstokes Spanwise Turbulence Statistics: /gpu/sycl/gen
Test: navierstokes Blasius: /gpu/sycl/shared
Test: navierstokes Blasius, STG Inflow: /gpu/sycl/shared
Test: navierstokes Blasius, STG Inflow, Weak Temperature: /gpu/sycl/shared
Test: navierstokes Blasius, Strong STG Inflow: /gpu/sycl/shared
Test: navierstokes Channel: /gpu/sycl/gen
Test: navierstokes Channel, Primitive: /gpu/sycl/gen
Test: navierstokes Density Current, explicit: /gpu/sycl/shared
Test: navierstokes Density Current, implicit, no stabilization: /gpu/sycl/shared
Test: navierstokes Advection, rotation, implicit, SUPG stabilization: /gpu/sycl/shared
Test: navierstokes Advection 2D, rotation, explicit, strong form: /gpu/sycl/gen
Test: navierstokes Euler, explicit: /gpu/sycl/shared
Test: navierstokes Sod Shocktube, explicit, SU stabilization, y-z-beta shock capturing: /gpu/sycl/shared
Test: navierstokes Sod Shocktube, explicit, SU stabilization, y-z-beta shock capturing: /gpu/sycl/gen
The failures are inconsistent. On back-to-back runs, I see the following failure differences:
$ diff junit2_failure_names.log junit_failure_names.log
5a6
> Test: navierstokes Advection 2D, rotation, explicit, supg, consistent mass: /gpu/sycl/gen
13a15,17
> Test: navierstokes Blasius, Anisotropic Differential Filter: /gpu/sycl/gen
> Test: navierstokes Blasius, Isotropic Differential Filter: /gpu/sycl/shared
> Test: navierstokes Blasius, Isotropic Differential Filter: /gpu/sycl/gen
15a20
> Test: navierstokes Spanwise Turbulence Statistics: /gpu/sycl/gen
20c25,26
< Test: navierstokes Channel: /gpu/sycl/shared
---
> Test: navierstokes Channel: /gpu/sycl/gen
> Test: navierstokes Channel, Primitive: /gpu/sycl/gen
21a28
> Test: navierstokes Density Current, implicit, no stabilization: /gpu/sycl/shared
23,24d29
< Test: navierstokes Advection, translation, implicit, SU stabilization: /gpu/sycl/shared
< Test: navierstokes Advection 2D, rotation, explicit, strong form: /gpu/sycl/shared
26d30
< Test: navierstokes Advection 2D, rotation, implicit, SUPG stabilization: /gpu/sycl/shared
I've attached the make junit
results here:
junit.log
junit2.log
Most of the failures are:
TSStep has failed due to DIVERGED_NONLINEAR_SOLVE,
Some fail when comparing to the reference solution. Of note on those is the following:
Test: navierstokes Gaussian Wave, explicit, supg, IDL
$ build/navierstokes -ceed /gpu/sycl/shared -test_type solver -options_file examples/gaussianwave.yaml -compare_final_state_atol 1e-8 -compare_final_state_filename tests/output/fluids-navierstokes-gaussianwave-explicit.bin -dm_plex_box_faces 2,2,1 -ts_max_steps 5 -degree 3 -implicit false -ts_type rk -stab supg -state_var conservative -mass_ksp_type gmres -mass_pc_jacobi_type diagonal -idl_decay_time 2e-3 -idl_length 0.25 -idl_start 0 -idl_pressure 70
FAIL: stdout
Output:
Test failed with error norm 1.7366e+142
I say of note because:
/gpu/sycl/shared
Now the error norm might simply be due to numerical instabilities in the solution (it is explicit after all), but perhaps this problem in particular might illuminate the problems better.
Thanks for the notes. We will revisit the implementation of the kernels in the shared and gen backends. In the meantime, do you know if these tests also fail on the libCEED (fluids) side. Also worth checking are the ex1 and ex2 test cases in libCEED and if they pass with these backends.
It's the same behavior for the fluids example tests, minus the tests that are in HONEE and not libCEED.
ex1 and ex2 pass fine.
note - the SYCL backends also need a ton of updates from the CUDA/HIP backends, so that might be a more worthwhile usage of time since those are such extensive changes
Per suggestion of @nbeams , I tried the libCEED tests with export ZE_SERIALIZE=2
and they pass with this environment variable set. TBH, I'm not sure what it does, but I'm guessing it disallows some form of out-of-order execution.
ZE_SERIALIZE=2
forces all kernel launches to be serialized with respect to the host. Since that seems to fix the problem, that points to a sync issue somewhere being the source of the failures. Unfortunately, it doesn't help us narrow down where it's coming from...
@nbeams - when you say serialize the kernel launches, is that equivalent to meaning in-order execution of the queue? If that is the case, it might be easier to look for where we might have missed a queue synchronization.
It would make the kernels in-order, but I think it also means the kernel launches are blocking. I've been told it's like doing CUDA_LAUNCH_BLOCKING=1
.
I've seen some failures on SunSpot with the fluids examples. The general behavior is:
/gpu/sycl/ref
passes fine everytime/gpu/sycl/shared
fails about 90% of the time/gpu/sycl/gen
fails about 10% of the timeThe failures are only present on a few tests (SunSpot is down for maintenance today, so I can't confirm which ones exactly right now, but I'm fairly certain the Gaussian wave tests are one of them), but the above behavior is pretty consistent. This is observed using the
oneapi/release/2024.04.15.001
.The failure specifically is a non-linear solver divergence:
Given the relationships between the backends, I'm guessing the error is probably in the shared functions between the
shared
andgen
backends.Tagging @kris-rowe @uumesh