floor function triggers a host

I am trying to port a CUDA code to AMDGPU. A lot of things already work, but I have a problem with the floor function, which seems to trigger a host call. I guess the warning mean that floor is not implemented for AMDGPUs and that the CPU version is used instead? The julia code:

using AMDGPU

a = Float32[1.2,2.3,4.4]
b = zeros(Int16,length(a))

a_d  = roc(a)
b_d  = roc(b)

function foo_d!(a_d,b_d)
    index = (workgroupIdx().x - 1) * workgroupDim().x + workitemIdx().x
    stride = gridGroupDim().x * workgroupDim().x

    @inbounds for i = index:stride:length(a_d)
        b_d[i] = floor(Int16,a_d[i])
    end
end

@roc foo_d!(a_d,b_d)
@show Array(b_d)

The output:

┌ Warning: Global hostcalls detected!
│ - Source: MethodInstance for foo_d!(::AMDGPU.Device.ROCDeviceVector{Float32, 1}, ::AMDGPU.Device.ROCDeviceVector{Float32, 1})
│ - Hostcalls: [:malloc_hostcall]
│ 
│ Use `AMDGPU.synchronize(; stop_hostcalls=false)` to synchronize and stop them.
│ Otherwise, performance might degrade if they keep running in the background.
└ @ AMDGPU.Compiler ~/.julia/packages/AMDGPU/yqCEl/src/compiler/codegen.jl:208
Array(b_d) = Float32[1.0, 2.0, 4.0]
3-element Vector{Float32}:
 1.0
 2.0
 4.0

I use AMDGPU v1.1.2 on julia 1.11.1. Is there some information about how to implement this function?

It seems that there is a floating point version (single and double precision) for the floor function defined in ROC. https://rocm.docs.amd.com/projects/HIP/en/docs-6.0.0/reference/kernel_language.html

And floor(Float32,x) does seem to work.

However, in my case I would need an integer as I will use it as index to an array.

In any case, thanks a lot for this great package!

julia> versioninfo()
Julia Version 1.11.1
Commit 8f5b7ca12ad (2024-10-16 10:53 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 128 × AMD EPYC 7A53 64-Core Processor
  WORD_SIZE: 64
  LLVM: libLLVM-16.0.6 (ORCJIT, znver3)
Threads: 1 default, 0 interactive, 1 GC (on 128 virtual cores)
Environment:
  LD_LIBRARY_PATH = /opt/cray/pe/papi/7.1.0.1/lib64:/opt/cray/libfabric/1.15.2.0/lib64

julia> AMDGPU.versioninfo()
[ Info: AMDGPU versioninfo
┌───────────┬──────────────────┬───────────┬──────────────────────────────────────────────────────────────────────────────────────────┐
│ Available │ Name             │ Version   │ Path                                                                                     │
├───────────┼──────────────────┼───────────┼──────────────────────────────────────────────────────────────────────────────────────────┤
│     +     │ LLD              │ -         │ /opt/rocm/llvm/bin/ld.lld                                                                │
│     +     │ Device Libraries │ -         │ /users/barthale/.julia/artifacts/5ad5ecb46e3c334821f54c1feecc6c152b7b6a45/amdgcn/bitcode │
│     +     │ HIP              │ 6.0.32831 │ /opt/rocm-6.0.3/lib/libamdhip64.so                                                       │
│     +     │ rocBLAS          │ 4.0.0     │ /opt/rocm-6.0.3/lib/librocblas.so                                                        │
│     +     │ rocSOLVER        │ 3.24.0    │ /opt/rocm-6.0.3/lib/librocsolver.so                                                      │
│     +     │ rocSPARSE        │ -         │ /opt/rocm-6.0.3/lib/librocsparse.so                                                      │
│     +     │ rocRAND          │ 2.10.5    │ /opt/rocm-6.0.3/lib/librocrand.so                                                        │
│     +     │ rocFFT           │ 1.0.27    │ /opt/rocm-6.0.3/lib/librocfft.so                                                         │
│     +     │ MIOpen           │ 3.0.0     │ /opt/rocm-6.0.3/lib/libMIOpen.so                                                         │
└───────────┴──────────────────┴───────────┴──────────────────────────────────────────────────────────────────────────────────────────┘

[ Info: AMDGPU devices
┌────┬─────────────────────┬────────────────────────┬───────────┬────────────┬───────────────┐
│ Id │                Name │               GCN arch │ Wavefront │     Memory │ Shared Memory │
├────┼─────────────────────┼────────────────────────┼───────────┼────────────┼───────────────┤
│  1 │ AMD Instinct MI250X │ gfx90a:sramecc+:xnack- │        64 │ 63.984 GiB │    64.000 KiB │
└────┴─────────────────────┴────────────────────────┴───────────┴────────────┴───────────────┘

Hi. That is because regular floor does inexact precision check and if it fails, it throws an error, boxing the original value which launches malloc hostcall.

To use more GPU-friendly function you can use floor without conversion followed by unsafe_trunc:

julia> @code_llvm unsafe_trunc(Int, floor(1f0))
; Function Signature: unsafe_trunc(Type{Int64}, Float32)
;  @ float.jl:416 within `unsafe_trunc`
define i64 @julia_unsafe_trunc_836(float %"x::Float32") #0 {
top:
  %0 = fptosi float %"x::Float32" to i64
  %1 = freeze i64 %0
  ret i64 %1
}

You can also compare it with the original to see how fewer things it does:

julia> @code_llvm floor(Int, 1f0)
; Function Signature: floor(Type{Int64}, Float32)
;  @ rounding.jl:475 within `floor`
define i64 @julia_floor_794(float %"x::Float32") #0 {
top:
  %jlcallframe1 = alloca [3 x ptr], align 8
  %gcframe2 = alloca [4 x ptr], align 16
  call void @llvm.memset.p0.i64(ptr align 16 %gcframe2, i8 0, i64 32, i1 true)
  %thread_ptr = call ptr asm "movq %fs:0, $0", "=r"() #9
  %tls_ppgcstack = getelementptr i8, ptr %thread_ptr, i64 -8
  %tls_pgcstack = load ptr, ptr %tls_ppgcstack, align 8
  store i64 8, ptr %gcframe2, align 16
  %frame.prev = getelementptr inbounds ptr, ptr %gcframe2, i64 1
  %task.gcstack = load ptr, ptr %tls_pgcstack, align 8
  store ptr %task.gcstack, ptr %frame.prev, align 8
  store ptr %gcframe2, ptr %tls_pgcstack, align 8
; ┌ @ rounding.jl:479 within `round` @ float.jl:463
   %0 = call float @llvm.floor.f32(float %"x::Float32")
; │ @ rounding.jl:479 within `round`
; │┌ @ rounding.jl:480 within `_round_convert`
; ││┌ @ number.jl:7 within `convert`
; │││┌ @ float.jl:991 within `Int64`
; ││││┌ @ float.jl:619 within `<=`
       %1 = fcmp ult float %0, 0xC3E0000000000000
; ││││└
      %2 = fcmp uge float %0, 0x43E0000000000000
      %narrow.not = or i1 %1, %2
      %3 = fsub float %0, %0
      %4 = fcmp une float %3, 0.000000e+00
      %or.cond = or i1 %narrow.not, %4
      br i1 %or.cond, label %L17, label %L15

L15:                                              ; preds = %top
; ││││ @ float.jl:992 within `Int64`
; ││││┌ @ float.jl:416 within `unsafe_trunc`
       %5 = fptosi float %0 to i64
       %6 = freeze i64 %5
       %frame.prev9 = load ptr, ptr %frame.prev, align 8
       store ptr %frame.prev9, ptr %tls_pgcstack, align 8
; ││││└
      ret i64 %6

L17:                                              ; preds = %top
; ││││ @ float.jl:994 within `Int64`
      %7 = load ptr, ptr getelementptr (i8, ptr @jl_small_typeof, i64 256), align 8
      %gc_slot_addr_1 = getelementptr inbounds ptr, ptr %gcframe2, i64 3
      store ptr %7, ptr %gc_slot_addr_1, align 8
      %box_Float32 = call ptr @ijl_box_float32(float %0)
      %gc_slot_addr_0 = getelementptr inbounds ptr, ptr %gcframe2, i64 2
      store ptr %box_Float32, ptr %gc_slot_addr_0, align 16
      store ptr @"jl_sym#Int64#807.jit", ptr %jlcallframe1, align 8
      %8 = getelementptr inbounds ptr, ptr %jlcallframe1, i64 1
      store ptr %7, ptr %8, align 8
      %9 = getelementptr inbounds ptr, ptr %jlcallframe1, i64 2
      store ptr %box_Float32, ptr %9, align 8
      %10 = call nonnull ptr @j1_InexactError_805(ptr nonnull @"+Core.InexactError#806.jit", ptr nonnull %jlcallframe1, i32 3)
      call void @ijl_throw(ptr nonnull %10)
      unreachable
; └└└└
}

JuliaGPU / AMDGPU.jl

floor function triggers a host #702