JuliaSIMD / CPUSummary.jl

MIT License
7 stars 4 forks source link

CPUSummary.jl v0.1.14 breaks CI of Trixi.jl on skylake-avx512 #6

Open ranocha opened 2 years ago

ranocha commented 2 years ago

We observed some specific problems when going from CPUSummary.jl v0.1.8 to v0.1.14 at Trixi.jl. Everything is fine with the old version of CPUSummary.jl. CI also passes with the new version unless the GitHub CI runner happens to use LLVM: libLLVM-12.0.1 (ORCJIT, skylake-avx512) (either ubuntu-latest or windows-latest). I could reduce this problem at https://github.com/trixi-framework/TrixiDebug.jl. Using the latest version of CPUSummary.jl, CI fails on

Restricting CPUSummary.jl to v0.1.8 let's CI pass on

So far, we have not been able to reproduce this locally...

For context: We use some matrix multiplications based on matmul! from Octavian.jl. To me, it seems like these multiplications fail catastrophically, resulting in the errors shown in CI.

CC @sloede

ranocha commented 2 years ago

Additional information:

chriselrod commented 2 years ago

Unfortunately, CPUSummary 0.1.8 did not work under wine (that is, they'd segfault Julia as soon as you using CPUSummary, or using any package that depends on it), and this was required by my employer, therefore reverting the changes are not an option. The newer versions have been and continue to be mostly broken, but I'm not quite sure how to fix it.

I do have skylake-avx512 locally, so I probably just need to spend the time to figure out what is different in generic_topology.jl (which doesn't use hwloc) vs topology.jl, and then fix this plus perhaps also figure out why a misspecification will cause packages like Octavian to get wrong answers.

chriselrod commented 2 years ago

Unless you need to run Julia on wine, I suggest you pin CPUSummary 0.1.8.

chriselrod commented 2 years ago

One problem is that my check for "will Hwloc segfault Julia or throw an error": https://github.com/JuliaSIMD/CPUSummary.jl/blob/d7c3676c97739d9b5f06f269ecfdf672978db254/src/CPUSummary.jl#L15-L17 almost always returns a false positive, even though it passes when run from the REPL.

ranocha commented 2 years ago

Okay, thanks!

Unless you need to run Julia on wine, I suggest you pin CPUSummary 0.1.8.

Yeah, that's our current workaround at https://github.com/trixi-framework/Trixi.jl/pull/1083

ranocha commented 2 years ago

If using Hwloc is the problem, it seems to be weird that our CI reports CPUSummary.USE_HWLOC = true for CPUSummary.jl v0.1.14 (and fails tests afterwards), see https://github.com/trixi-framework/TrixiDebug.jl/runs/5493893614?check_suite_focus=true#step:6:391.

chriselrod commented 2 years ago

That also will also generally be inaccurate.

julia> using CPUSummary

julia> CPUSummary.USE_HWLOC
true

julia> isdefined(CPUSummary, :safe_topology_load!)
false

This is a far more reliable check. safe_topology_load! is defined in the file included when using Hwloc, but not in the other.

Therefore, look at isdefined(CPUSummary, :safe_topology_load!) instead of USE_HWLOC.

ranocha commented 2 years ago

Oh, okay. That's indeed false (see https://github.com/trixi-framework/TrixiDebug.jl/runs/5495806075?check_suite_focus=true#step:6:396).