Demo of setting a basic memory handler

Hi there,

I am intending to bind cuStateVec to a mempool to avoid having to explicitly manage workspaces, as described in this doc. I wish to do this for code hygiene (of a "draft" cuQuantum implementation) rather than for performance, so I seek something simple. That doc shows the boilerplate necessary for using a custom memory pool in a very general way, but I am wondering whether this can be reduced/simplified if using a device's default pool.

Here is how the doc might suggest using the default pool:

// get the default mem pool (assuming single GPU)
int device = 0;
cudaMemPool_t memPool;
cudaDeviceGetMemPool(&memPool, device);

// optionally tweak it here (affecting existing pool), e.g.
cudaMemPoolSetAttribute (memPool, cudaMemPoolAttrReleaseThreshold, 16*(1LL<<10)); 

// make mem pool alloc & dealloc callbacks for mem handler
int my_alloc(void* ctx, void** ptr, size_t size, cudaStream_t stream) {
    cudaMemPool_t pool = * reinterpret_cast<cudaMemPool_t*>(ctx);
    return cudaMallocFromPoolAsync(ptr, size, pool, stream); 
}
int my_dealloc(void* ctx, void* ptr, size_t size, cudaStream_t stream) {
    return cudaFreeAsync(ptr, stream); 
}

// create a mem handler around the mem pool
custatevecDeviceMemHandler_t handler;
handler.ctx = reinterpret_cast<void*>(&memPool);
handler.device_alloc = my_alloc;
handler.device_free = my_dealloc;

// set mem handler to auto-manage workspaces (handle = custatevecHandle_t)
custatevecSetDeviceMemHandler(handle, &handler);

This seems unnecessarily tedious to me, given I'm not really specifying any custom behavior; just mapping device_alloc to cudaMallocFromPoolAsync and device_free to cudaFreeAsync. I sort of imagine that this setup (using the default pool) is what should happen if one calls custatevecApplyMatrix (for example) with extraWorkspace=nullptr, without having previously called custatevecSetDeviceMemHandler().

Is above the right way to use the default mem pool? Or is there a reason I should avoid using an existing pool? If this is all fine, and I understand correctly this will be a common use-case (especially for new cuQuantum users not intimately familiar with CUDA), could there exist a bespoke function to avoid this boilerplate? E.g.

custatevecSetDeviceMemHandlerToDefaultMemPool();

(and of course such a function would be expected to error if the user's device does not support stream-ordered memory)

In any case, it might be helpful to mention in the cuStateVec doc that the default mem pool (rather than a custom one) can be used.

NVIDIA / cuQuantum

Demo of setting a basic memory handler #81