Open ranocha opened 2 years ago
Additional information:
LLVM: libLLVM-12.0.1 (ORCJIT, skylake-avx512)
: https://github.com/trixi-framework/TrixiDebug.jl/runs/5493794139?check_suite_focus=true#step:6:359LLVM: libLLVM-12.0.1 (ORCJIT, skylake-avx512)
: https://github.com/trixi-framework/TrixiDebug.jl/runs/5493893614?check_suite_focus=true#step:6:359. Additional information about CPUSummary.jl is shown at https://github.com/trixi-framework/TrixiDebug.jl/runs/5493893614?check_suite_focus=true#step:6:391. I can add any additional information that might help you but I do not really have an idea...Unfortunately, CPUSummary 0.1.8 did not work under wine (that is, they'd segfault Julia as soon as you using CPUSummary
, or using
any package that depends on it), and this was required by my employer, therefore reverting the changes are not an option.
The newer versions have been and continue to be mostly broken, but I'm not quite sure how to fix it.
I do have skylake-avx512 locally, so I probably just need to spend the time to figure out what is different in generic_topology.jl
(which doesn't use hwloc) vs topology.jl
, and then fix this plus perhaps also figure out why a misspecification will cause packages like Octavian to get wrong answers.
Unless you need to run Julia on wine, I suggest you pin CPUSummary 0.1.8.
One problem is that my check for "will Hwloc segfault Julia or throw an error": https://github.com/JuliaSIMD/CPUSummary.jl/blob/d7c3676c97739d9b5f06f269ecfdf672978db254/src/CPUSummary.jl#L15-L17 almost always returns a false positive, even though it passes when run from the REPL.
Okay, thanks!
Unless you need to run Julia on wine, I suggest you pin CPUSummary 0.1.8.
Yeah, that's our current workaround at https://github.com/trixi-framework/Trixi.jl/pull/1083
If using Hwloc is the problem, it seems to be weird that our CI reports CPUSummary.USE_HWLOC = true
for CPUSummary.jl v0.1.14 (and fails tests afterwards), see https://github.com/trixi-framework/TrixiDebug.jl/runs/5493893614?check_suite_focus=true#step:6:391.
That also will also generally be inaccurate.
julia> using CPUSummary
julia> CPUSummary.USE_HWLOC
true
julia> isdefined(CPUSummary, :safe_topology_load!)
false
This is a far more reliable check. safe_topology_load!
is defined in the file included when using Hwloc
, but not in the other.
Therefore, look at isdefined(CPUSummary, :safe_topology_load!)
instead of USE_HWLOC
.
Oh, okay. That's indeed false
(see https://github.com/trixi-framework/TrixiDebug.jl/runs/5495806075?check_suite_focus=true#step:6:396).
We observed some specific problems when going from CPUSummary.jl v0.1.8 to v0.1.14 at Trixi.jl. Everything is fine with the old version of CPUSummary.jl. CI also passes with the new version unless the GitHub CI runner happens to use
LLVM: libLLVM-12.0.1 (ORCJIT, skylake-avx512)
(eitherubuntu-latest
orwindows-latest
). I could reduce this problem at https://github.com/trixi-framework/TrixiDebug.jl. Using the latest version of CPUSummary.jl, CI fails onubuntu-latest
(e.g., https://github.com/trixi-framework/TrixiDebug.jl/runs/5492313195?check_suite_focus=true#step:6:357)windows-latest
(e.g., https://github.com/trixi-framework/TrixiDebug.jl/runs/5492410761?check_suite_focus=true#step:6:356)Restricting CPUSummary.jl to v0.1.8 let's CI pass on
ubuntu-latest
withLLVM: libLLVM-12.0.1 (ORCJIT, skylake-avx512)
(https://github.com/trixi-framework/TrixiDebug.jl/runs/5493268766?check_suite_focus=true#step:6:358)windows-latest
withLLVM: libLLVM-12.0.1 (ORCJIT, skylake-avx512)
(https://github.com/trixi-framework/TrixiDebug.jl/runs/5493268841?check_suite_focus=true#step:6:357)So far, we have not been able to reproduce this locally...
For context: We use some matrix multiplications based on
matmul!
from Octavian.jl. To me, it seems like these multiplications fail catastrophically, resulting in the errors shown in CI.CC @sloede