Open mynotwo opened 3 months ago
Hi! In this case the .copy_
operation is non-blocking. Meaning it doesn't wait for the underlying copy to finish, but lets the python thread proceed as soon as the operation is submitted. You might want to look into torch's profiler. I recommend you export your traces into json and view them using perfetto or chrome://tracing.
Hi, thanks for your work! I recently wanna benchmark each step's latency of this repo, and I found if I use torch.cuda.synchonize() and time.time(), I cannot get the actual data copy time.
For example, I believe the data copy time is those two lines.
And time.time gives me 1e-5s, which I believe is far faster than real data transfer latency. I think the reason might be there exist multiple process/threads and would lead to wrong latency. Could you help me solve this problem?
Many thanks!