writeBufferAsync seems to be blocking from a GPU perspective

acowley / CLUtil

Thin abstraction layer over the Haskell OpenCL library.

BSD 3-Clause "New" or "Revised" License

6 stars 4 forks source link

Hi there!

I've been helping review some performance problems with code using OpenCL through CLUtil, and with OpenCL, pipeline bottlenecks are a classic for causing big performance drops. To try and address that the code is partly moving from synchronous calls to async calls to get maximum throughput from pipelining

However on doing a review of CLUtil, some async functions seem block the GPU pipeline. My haskell knowledge isn't up to scratch so please browbeat me if I'm wrong!

-- | Write a 'Vector''s contents to a buffer object. This operation -- is non-blocking. writeBufferAsync :: forall a m. (Storable a, HasCL m) => CLBuffer a -> V.Vector a -> m (CLAsync ()) writeBufferAsync (CLBuffer n mem) v = do when (V.length v > n) (throwError "writeBuffer: Vector is bigger than the CLBuffer") q <- clQueue <$> ask ev <- liftIO . V.unsafeWith v $ \ptr -> clEnqueueWriteBuffer q mem True 0 sz (castPtr ptr) [] return . clAsync ev $ return () where sz = V.length v * sizeOf (undefined::a)

Internally calls clEnqueueWriteBuffer with blocking set to true, which would cause a pipeline stall

If this is correct, I would suggest instead copying the input buffer to some safe non expiring memory somewhere, potentially into pcie accessible memory (CL_MEM_ALLOC_HOST_POINTER) to avoid doing another copy into pcie memory, then using that memory in your async write. The easiest way to clean up the memory after it is written is to use a callback on the OpenCL event when it is completed

My haskell knowledge again isn't good enough to answer the next question, is https://github.com/acowley/CLUtil/blob/master/src/CLUtil/KernelArgs.hs#L506 where kernel arguments are copied to the gpu for both synchronous and async kernel invocations, or if not could you point me to where kernel arguments are copied to the GPU for async kernel invocations? That code is similarly synchronous but I'm currently unable to answer if this is related

Thanks! :)

Thank you for the great notes! I think this got bungled up during a revision at some point. I've not used the callback API from Haskell, so I don't know exactly how that will go. We can either create a copy of the Vector data as you suggest, or perhaps we could arrange to touch the data in the callback after capturing it in a closure... but I'm not sure if that will work out with the callback mechanism.

I think you're looking at the right source, but iirc we only support Vector arguments to runKernel in a synchronous setting. Supporting Vector arguments is mainly there for small tests and exploration as I think a more common strategy when optimizing a larger program is to re-use buffer objects.

Would you like to start a PR that fixes writeBufferAsync?

(Warning: slight tangent ahead) This usability aspect contributed to some tension that has hurt the package design over time as it teeters between trying to be simple / faithful to the C API and offering magical niceties. I would certainly like it if I could pass a Vector, have a pool of buffer objects from which we could pull an appropriately sized and typed one, do an async copy, and push that event forward to block the start of the eventual kernel. But fitting resource pool layers into CLUtil has been a real morass. I'm still for it, but I am now wary of how difficult it is. A common issue is that the API is so large that doing anything that requires recapitulating the API (e.g. with resource management and without) can easily become unwieldy.

acowley / CLUtil

writeBufferAsync seems to be blocking from a GPU perspective #6