Is it possible to use it without Cooperative Kernel? (regardless of performance loss) (it works)

NVIDIA / nv-wavenet

Reference implementation of real-time autoregressive wavenet inference

BSD 3-Clause "New" or "Revised" License

735 stars 126 forks source link

Is it possible to use it without Cooperative Kernel? (regardless of performance loss) (it works) #36

Closed engiecat closed 6 years ago

engiecat commented 6 years ago

Firstly, Thank you for developing this repos :)

I've been trying to develop a Windows port of this repos, and I managed to build the pytorch version of this using MSVC, and I had met with CUDA error 71 (cudaErrorNotSupported). (I'm currently using GTX1060 6GB on Windows 10 with CUDA 9.0)

The error was tracked to https://github.com/NVIDIA/nv-wavenet/blob/0822dc523b0873f4d9cabd24364787dcb01377a2/nv_wavenet_persistent.cuh#L529

Via https://devtalk.nvidia.com/default/topic/1022751/cuda-setup-and-installation/gtx-1080-does-not-support-cooperative-kernel-launch-/, I discovered that co-op kernel is only available with linux or Windows in TCC mode.

Would it be possible to use in non-coop kernel (change cudaLaunchCooperativeKernel to cudaLaunchKernel), and if possible, how much performance loss would there be?

engiecat commented 6 years ago

I tested with Tesla P40 @ Ubuntu 16.04 machine and discovered that it runs quite okay. Time measurement Without Coop: real 0m8.093s user 0m5.840s sys 0m2.316s

With Coop: real 0m8.089s user 0m5.888s sys 0m2.260s

engiecat commented 6 years ago

The problem is, it still doesn't run in my windows machine, with GPUassert: unspecified launch failure with cudaDeviceSynchronize https://github.com/NVIDIA/nv-wavenet/blob/0822dc523b0873f4d9cabd24364787dcb01377a2/pytorch/wavenet_infer.cu#L98

Maybe due to WDDM TDR feature, but it didn't recover for 20+ minutes so there seems to be a problem.

BrianPharris commented 6 years ago

Changing cudaLaunchCooperative to cudaLaunch will not affect performance, but it will prevent the CUDA driver from being able to guarantee simultaneous execution of the synchronizing threads. This could lead to deadlock.

The single-block variant does not require cooperative groups.

engiecat commented 6 years ago

@BrianPharris That's probably why it hangs with windows. Thank you for the information!

engiecat commented 6 years ago

It works with single block variant! Thank you!

eric-haibin-lin commented 6 years ago

@BrianPharris I don't see any explicit usage of thread_group/block synchronization with cooperative group in the persistent kernel. Would using that make synchronization cost lower (instead of using spin lock with global memory)?