dotnet / TorchSharp

A .NET library that provides access to the library that powers PyTorch.
MIT License
1.4k stars 182 forks source link

cudnn_ops_train64.dll not found on Windows #212

Closed dsyme closed 3 years ago

dsyme commented 3 years ago

We have a strange bug where LibTorch doesn't successfully link to cudnn_ops_train64.dll on Windows.

The packages we build place these in the usual runtimes\win-x64\native directory along with all other LibTorch DLLs.

LibTorch uses a lot of dynamic linking to correctly select the DLL to use along various axes (CUDA v. non-CUDA, Train v Infer and so on). In some cases these requests bubble down into other DLLs such as

cudnn_adv_infer64_8.dll
cudnn_adv_train64_8.dll
cudnn_cnn_infer64_8.dll
cudnn_cnn_train64_8.dll
cudnn_ops_infer64_8.dll
cudnn_ops_train64_8.dll

However the dynamic link to these DLLs is failing when they are in the runtimes\win-x64\native directory of the application. AFAIK the location of torch.dll should really be enough to resolve all dynamic linking and it's confusing that it's not.

One workaround seems to be to preload these DLLs into the application - the dynamic linking then finds them correctly. We already do this in the TorchSharp initialization code for Windows for some other DLLs which suffer the same problem:

                    NativeLibrary.TryLoad("nvrtc-builtins64_111", typeof(Torch).Assembly, null, out var res8);
                    NativeLibrary.TryLoad("caffe2_nvrtc", typeof(Torch).Assembly, null, out var res9);
                    NativeLibrary.TryLoad("nvrtc64_111_0", typeof(Torch).Assembly, null, out var res10);

I will add these new DLLs to the pre-load list. It could be we get this problem with further DLLs in the future.

This is not an ideal workaround and is fragile as LibTorch gets updated, but at this stage I'm not sure what else we should do (short of either not using packages and relying on an on-path install - yuck - or not placing native DLLs into the standard runtimes\win-x64\native, or some other such thing).

dsyme commented 3 years ago

Fixed for now by #213