Open RDambrosio016 opened 2 years ago
I have temporarily disabled StreamFlags::NON_BLOCKING
in the unreleased version of cust. This should not have a major performance impact since cust does not expose the null stream anyways. I'll leave this open until NVIDIA gets back to me about this issue
@RDambrosio016 could this be solved by using the async version of the copy functions, which require a stream argument? If I'm understanding correctly the issue comes from the default Memcpy functions using the null stream. but since cust doesn't expose the null stream, all the kernels are on different streams than the copies.
Streams with
NON_BLOCKING
exhibit very confusing and very dangerous behavior with regards to memcpy due to odd CUDA semantics, per the driver API docs:Because
NON_BLOCKING
streams do not synchronize with the null (default) stream, this leads to potential race conditions. NVIDIA appears to be aware of this issue, but in the mean time, it may be beneficial to implicitly disableNON_BLOCKING
for now. Especially since cust does not expose stream ordered memory allocation.This is what appears to be happening in the
add
example sometimes not doing anything on certain systems.