JuliaAttic / CUDArt.jl

Julia wrapper for CUDA runtime API
Other
79 stars 29 forks source link

Unified Memory support #99

Open barche opened 7 years ago

barche commented 7 years ago

The following code reproduces the Unified Memory example from NVIDIA in Julia: https://gist.github.com/barche/9cc583ad85dd2d02782642af04f44dd7#file-add_cudart-jl

Kernel run time is the same as with the .cu compiled with nvcc, the nvprof output I get is this:

Time(%)      Time     Calls       Avg       Min       Max  Name
 61.18%  871.81us        11  79.255us  78.689us  79.872us  julia_kernel_add_61609
 38.82%  553.09us        11  50.280us  48.832us  53.344us  julia_kernel_init_61427

I decided to attempt to make the interface a little nicer, by creating a UnifiedArray type modeled after CuDeviceArray, represented in this file together with the test: https://gist.github.com/barche/9cc583ad85dd2d02782642af04f44dd7#file-unifiedarray-jl

Unfortunately, this runs significantly slower:

Time(%)      Time     Calls       Avg       Min       Max  Name
 56.90%  1.0317ms        11  93.792us  91.520us  100.48us  julia_kernel_add_61608
 41.03%  743.85us        11  67.622us  54.369us  77.472us  julia_kernel_init_61428
  2.07%  37.536us        55     682ns     640ns  1.1520us  [CUDA memcpy HtoD]

Comparing the @code_llvm output for the init kernel after the ifshows for the first version:

  %16 = getelementptr float, float* %1, i64 %15, !dbg !21
  %17 = getelementptr float, float* %0, i64 %15, !dbg !20
  store float 1.000000e+00, float* %17, align 8, !dbg !20, !tbaa !22
  store float 2.000000e+00, float* %16, align 8, !dbg !21, !tbaa !22
  br label %L47, !dbg !21

and for the UnifiedArray version:

  %16 = getelementptr inbounds %UnifiedArray.4, %UnifiedArray.4* %0, i64 0, i32 0, !dbg !23
  %17 = add i64 %12, -1, !dbg !23
  %18 = load float*, float** %16, align 8, !dbg !23, !tbaa !20
  %19 = getelementptr float, float* %18, i64 %17, !dbg !23
  store float 1.000000e+00, float* %19, align 8, !dbg !23, !tbaa !24
  %20 = getelementptr inbounds %UnifiedArray.4, %UnifiedArray.4* %1, i64 0, i32 0, !dbg !26
  %21 = load float*, float** %20, align 8, !dbg !26, !tbaa !20
  %22 = getelementptr float, float* %21, i64 %17, !dbg !26
  store float 2.000000e+00, float* %22, align 8, !dbg !26, !tbaa !24
  br label %L47, !dbg !26

So now for the questions:

p.s. great job on all these CUDA packages, this was a lot easier to set up than I had anticipated :)

maleadt commented 7 years ago

FYI: CUDArt is somewhat unmaintained, me / @timholy / @vchuravy occasionally check in small compatibility fixes and tag new releases, but my time at least is spend on CUDAdrv...

That said...

Where does this difference in performance come from, and is it possible to keep the array abstraction and have it perform as well as the pointer version?

In your first example, you pass a literal pointer (to global memory) to the kernel. This pointer itself is a bitstype (primitive type in modern nomenclature), which means it is passed by value and resides in parameter space, a constant memory that doesn't require synchronization for thread accesses. Dereferencing the pointer however does need synchronization, as it points to global memory.

Your second example passes a UnifiedArray, which is not a primitive type, but an aggregate type (LLVM nomenclature) which Julia passes by pointer. This means that there is an extra indirection, first to dereference the pointer to the UnifiedArray, then to access the underlying pointer.

However, this should have been fixed recently JuliaGPU/CUDAnative.jl#78, so I presume you were using an older version of CUDAnative?

Are there any plans to add an array based on the Unified Memory model?

No, because up until very recently there was no benefit, except for programmability which already was pretty seamless thanks to automatic conversion at the @cuda boundary. However, recent GPUs use page faulting + "speculative" execution to prevent having to transfer memory right away, so it might be beneficial to do so. I won't be spending time on it though (priorities...), maybe you will? 😃 Might make sense to figure out how to make our GPU array / buffer type a bit more portable though, putting it all in CUDAdrv or CUDAnative does seem out of scope (cc @SimonDanisch @MikeInnes)

Are there any plans to wrap the CUDA8 functions, such as cudaMemPrefetchAsync?

No, although I don't think it would be much work. At least for CUDAdrv (are you using CUDArt for a specific reason?).

barche commented 7 years ago

Ah, yes, it was because of an old version, both examples run at the same speed now. I was using the runtime API because that's what the NVIDIA beginner's tutorial proposed, but I now converted it to the driver API, which was in fact very easy using @apicall.

I'm not sure I'm the one to implement a new GPU array type at this point, considering I'm still taking baby steps with CUDA here :)

cdsousa commented 6 years ago

@barche, can you share the version using CUDAdrv.jl, please?

barche commented 6 years ago

@cdsousa Done, see https://gist.github.com/barche/9cc583ad85dd2d02782642af04f44dd7#file-add_cuda-jl

cdsousa commented 6 years ago

Thanks @barche , but I supposed you had converted the example of "Unified Memory example" to CUDAdrv... I have been trying it myself but I'm getting errors when calling cuMemPrefetchAsync from the driver API using the @apicall. That was what I hoped you had already solved.

maleadt commented 6 years ago

Doesn't it work by just changing the :cuda calls to their :cu counterparts, using @apicall instead of CUDArt.rt.checkerror(ccall(...))? Also try using cuda-memcheck for possibly better error messages. If you can't get it to work, I can have a quick look. I won't have time for designing proper abstractions anytime soon though, but at least some working code would be a good first step.

barche commented 6 years ago

I have just updated the gist (file unifiedarray.jl), I didn't realize that was still the old version, sorry

cdsousa commented 6 years ago

Ah, thank you very much @barche, that's exactly what I was looking for. :+1:

Unfortunately, it is more or less what I was trying and thus throws the same error (ERROR_INVALID_VALUE) on the cuMemPrefetchAsync.

I will try to follow @maleadt suggestion and try to understand what's going on. Maybe there is something special about the platform I'm experimenting with, a Jetson TX1. I had already successfully used unified memory but it was on C++/JetsonTX2/RuntimeAPI...

I'll put additional questions to disccourse. And if I have time I would like to further develop and propose the abstractions to unified memory into CUDAdrv.jl.

cdsousa commented 6 years ago

Ok, probably it is because "Maxwell architectures [..] support a more limited form of Unified Memory" :man_facepalming:. I'll try in a Jetson TX2. Thank you both.

barche commented 6 years ago

I'll test tonight on my machine at home to confirm it still works, it has been a while since I tried this.

barche commented 6 years ago

To confirm, I tried on my GTX 1060 and it still worked.