Open Sa1ntPr0 opened 6 months ago
It is rare that we see adventurous users trying out stream capture (it was sorta in an "experimental" state so we didn't advertise it, https://github.com/cupy/cupy/issues/6290), so thanks for reaching out and raising the question!
I would think, at least from CUDA perspective, that cases 5 & 6 are expected. The key point is: During stream capture, there's no actual kernel launch. So all CUDA sees with this line
a=CudaMin(A)
during capture is:
A
is recordeda
is allocated from CuPy's mempool, and its pointer address is recordedCudaMin
would be A
's pointer as input and a
's as outputBy the time the graph is launched, the recorded pointer addresses would be reused for the actual kernel launch.
Note that for step 2 we rely on the fact there's a mempool; if we were to disable the pool and only use bare cudaMalloc
under the hood, the capture would fail.
Thanks for the answer! To be honest, I am not very familiar with CUDA and I am just an amateur in programming and therefore I could not fully understand the behavior when capturing a stream from your answer. For example, I still don’t understand why the value at address a
changes after capturing and why one of the graphs works and the other doesn’t. But since you say that this is how it should happen, I believe you :)
However, it might be worth adding some kind of warning when trying to capture operations like a=CudaMin(A)
so that inexperienced users like me can quickly understand why their code does not behave as they would like.
Sorry I dropped the ball. @Sa1ntPr0 these are all legit questions. Let me focus on Case 5 since the confusion comes from the same root cause (interplay between Python, CuPy, and CUDA).
For example, I still don’t understand why the value at address
a
changes after capturing and why one of the graphs works and the other doesn’t.
In Case 5, it's because originally you have
a=cp.asarray(10,dtype=cp.float32)
in the beginning, but later during capture of graph 2 you bind a new array instance to the name a
:
with stream:
...
a=CudaMin(A)
...
and so at later times when a
is referenced in the print
function, it refers to this instance instead of the earlier instance. Let me know if this makes better sense to you.
Sorry for not returning to this issue for so long.
Thank you, I think I'm starting to understand.
Firstly, initially I had the false idea that during the capture of a graph, ABSOLUTELY NO real actions are and cannot be performed. Therefore, it seemed very strange to me that something was happening with my output 0d array a
.
Secondly, I thought that CuPy would treat a
as a pointer to a value in the array, since a
is 0-dimensional. But that's not true. If I understand correctly, if a
were a 1-dimensional array a=cp.asarray([10],dtype=cp.float32)
, then a[0]=CudaMin(A)
would lead to the behavior I want, since a[0]
would be treated as a pointer to an array element.
If I use a 0-dimensional array a
, is there a way to show CuPy that I want to use a
as a pointer and have the result of CudaMin(A)
simply be written to the address that a
points to, rather than creating a new instance of a
for the result? (Besides using CudaMin(A,out=a)
as in Case 3)
Description
Perhaps this behavior is intentional, but I did not find information about it and it really confused me when I encountered it. I am sorry if I am missing something. I recorded 2 graphs via stream capture and used cupy.ReductionKernel in both graphs - I needed to write the minimum value of some array into a 0-dimensional CuPy variable. I was faced with the fact that one of the graphs didn't work as intended, but didn't cause any errors either. It turned out that this behavior was caused by the line a=CudaMin(A) in both graphs; replacing it with CudaMin(A,out=a) solved the problem. (CudaMin is my ReductionKernel) When a=CudaMin(A) was used, this operation was skipped in one of the graphs and the variable "a" remained unchanged. I hope this code shows what I'm talking about. The most confusing cases are 5 and 6.
To Reproduce
Just imports, this part is the same for all code blocks below
Case 1:
Output 1:
Case 2:
Output 2:
Case 3:
Output 3:
Case 4:
Output 4:
Case 5:
Output 5:
Case 6:
Output 6:
Installation
Conda-Forge (
conda install ...
)Environment
Additional Information
No response