Open jonas-schulze opened 3 months ago
For what is worth, I couldn't reproduce the issue on Nvidia Grace (ARM Neoverse V2, which has bf16
extension) on
julia> versioninfo()
Julia Version 1.12.0-DEV.325
Commit e9a24d4cee4 (2024-04-10 13:11 UTC)
Build Info:
Official https://julialang.org/ release
Platform Info:
OS: Linux (aarch64-linux-gnu)
CPU: 72 × unknown
WORD_SIZE: 64
LLVM: libLLVM-16.0.6 (ORCJIT, neoverse-v2)
Threads: 1 default, 0 interactive, 1 GC (on 72 virtual cores)
I know it's a different architecture, but just to say this isn't completely broken everywhere :slightly_smiling_face:
It might be worth trying this out with assertions on. And on latest master.
Though be aware that native BFloat16 support requires LLVM 17 IIRC
I just tried an assert build of the current master and got a segfault:
$ ~/git/julia/julia
_
_ _ _(_)_ | Documentation: https://docs.julialang.org
(_) | (_) (_) |
_ _ _| |_ __ _ | Type "?" for help, "]?" for Pkg help.
| | | | | | |/ _` | |
| | |_| | | | (_| | | Version 1.12.0-DEV.334 (2024-04-12)
_/ |\__'_|_|_|\__'_| | Commit 1ae41a2c0a (0 days old master)
|__/ |
(v0.5.0) pkg> activate --temp
Activating new project at `/tmp/jl_8a5off`
(jl_8a5off) pkg> add BFloat16s
Updating registry at `~/.julia/registries/General.toml`
Resolving package versions...
Updating `/tmp/jl_8a5off/Project.toml`
[ab4f0b2a] + BFloat16s v0.5.0
Updating `/tmp/jl_8a5off/Manifest.toml`
[ab4f0b2a] + BFloat16s v0.5.0
[56f22d72] + Artifacts v1.11.0
[2a0f44e3] + Base64 v1.11.0
[b77e0a4c] + InteractiveUtils v1.11.0
[8f399da3] + Libdl v1.11.0
[37e2e46d] + LinearAlgebra v1.11.0
[56ddb016] + Logging v1.11.0
[d6f4376e] + Markdown v1.11.0
[de0858da] + Printf v1.11.0
[9a3f8284] + Random v1.11.0
[ea8e919c] + SHA v0.7.0
[9e88b42a] + Serialization v1.11.0
[f489334b] + StyledStrings v1.11.0
[8dfed614] + Test v1.11.0
[4ec0a83e] + Unicode v1.11.0
[e66e0078] + CompilerSupportLibraries_jll v1.1.1+0
[4536629a] + OpenBLAS_jll v0.3.26+2
[8e850b90] + libblastrampoline_jll v5.8.0+1
Precompiling all packages...
1 dependency successfully precompiled in 2 seconds. 9 already precompiled.
julia> using BFloat16s
julia> A = ones(BFloat16, 10, 10);
julia> @code_llvm A+A
Segmentation fault (core dumped)
Though be aware that native BFloat16 support requires LLVM 17 IIRC
This is only for ARM. For x86 it should be LLVM 15:
For what is worth, I couldn't reproduce the issue on Nvidia Grace (ARM Neoverse V2, which has
bf16
extension) on
Given the link in my previous comment, the configuration you tried, @giordano, had BFloat16s.llvm_arithmetic == false
. That is, I suspect that it did not generate fadd bfloat
but only emulated the computations; see https://github.com/JuliaMath/BFloat16s.jl/issues/68#issuecomment-2025890696. Would you mind to verify?
Using the assert build on AMD EPYC 9554 I still don't see fadd bfloat
... :slightly_frowning_face:
julia> a = one(BFloat16)
BFloat16(1.0)
julia> @code_llvm a+a
; Function Signature: +(Core.BFloat16, Core.BFloat16)
; @ /home/jschulze/.julia/packages/BFloat16s/u3WQc/src/bfloat16.jl:225 within `+`
define bfloat @"julia_+_4671"(bfloat %"x::BFloat16", bfloat %"y::BFloat16") #0 {
top:
%0 = fpext bfloat %"x::BFloat16" to float
%1 = fpext bfloat %"y::BFloat16" to float
%2 = fadd float %0, %1
%3 = fptrunc float %2 to bfloat
ret bfloat %3
}
julia> BFloat16s.llvm_arithmetic
true
Yeah, I mentioned yesterday on Slack that on aarch64 I don't get bfloat
types:
julia> code_llvm(+, NTuple{2,BFloat16}; debuginfo=:none)
; Function Signature: +(BFloat16s.BFloat16, BFloat16s.BFloat16)
define i16 @"julia_+_6962"(i16 zeroext %"x::BFloat16", i16 zeroext %"y::BFloat16") #0 {
top:
%0 = zext i16 %"x::BFloat16" to i32
%1 = shl nuw i32 %0, 16
%bitcast_coercion = bitcast i32 %1 to float
%2 = zext i16 %"y::BFloat16" to i32
%3 = shl nuw i32 %2, 16
%bitcast_coercion7 = bitcast i32 %3 to float
%4 = fadd float %bitcast_coercion, %bitcast_coercion7
%5 = fcmp ord float %4, 0.000000e+00
br i1 %5, label %L13, label %L30
L13: ; preds = %top
%bitcast_coercion9 = bitcast float %4 to i32
%6 = lshr i32 %bitcast_coercion9, 16
%7 = and i32 %6, 1
%narrow = add nuw nsw i32 %7, 32767
%8 = zext i32 %narrow to i64
%9 = zext i32 %bitcast_coercion9 to i64
%10 = add nuw nsw i64 %8, %9
%11 = lshr i64 %10, 16
%12 = trunc i64 %11 to i16
br label %L30
L30: ; preds = %L13, %top
%value_phi = phi i16 [ %12, %L13 ], [ 32704, %top ]
ret i16 %value_phi
}
I think this only gets enabled on LLVM 17. We have some annoying code because of the couple ABI breaks that have happened
There is an infinite recursion within LLVM in between
X86TTIImpl::getShuffleCost(...)
,X86TTIImpl::getVectorInstrCost(...)
, andBasicTTIImplBase<T>::getShuffleCost(...)
,which leads to a StackOverflow
. I suspect the 100% CPU utilization (one core) simply were due to Julia trying to prepare the backtrace, since ^C
led to showing an error caused an error
.
I briefly checked above LLVM code, but couldn't make much sense of it (unless the get*Overhead
were inlined into BasicTTIImplBase<T>::getShuffleCost
and therefore didn't show up in the backtrace below). It may also be the case that the files I linked to are not the correct ones; I didn't fully understand the Makefile
.
How should we proceed with this bug? Can I hand this off to you / one of the core developers? I'm not an LLVM expert and don't have much time to dig into this at the moment, unfortunately.
The problem persists on the current nightly, Version 1.12.0-DEV.629 (2024-05-30).
The steps to this would be that someone finds a standalone llvm IR reproducer and then we can file this with upstream.
But how if it's @code_llvm ...
that fails, not its execution?
You would use JULIA_LLVM_ARGS="--print-before=LoopVectorize"
to get the IR before the vectorizer runs and then verify that opt -vectorize ir.ll
also hangs.
As of #51470 and https://github.com/JuliaMath/BFloat16s.jl/pull/51, I was hoping that Julia may natively support BF16 on the CPU. I did some smoke testing on an CPU with
avx512_bf16
support (according tolscpu
) but observed some strange failure modes:@code_llvm A+A
below)@code_llvm A*A
below)In https://github.com/JuliaMath/BFloat16s.jl/issues/68 I tried executing some code on
v1.11.0-alpha2
, while below I merely try to generate the LLVM IR but on the currentnightly
(available fromjuliaup
).