bytedeco / javacpp-presets

The missing Java distribution of native C++ libraries
Other
2.65k stars 736 forks source link

[PyTorch] torch.cuda.is_bf16_supported() is missing #1507

Closed haifengl closed 3 months ago

haifengl commented 3 months ago

I cannot find it anywhere.

HGuillemet commented 3 months ago

That is a Python-only function. From the presets, you can check for the device compute capability, or just try to create a small BF16 tensor.

haifengl commented 3 months ago

Thanks! How to check device compute capability? torch_cuda.getDeviceProperties() returns a plain Pointer.

haifengl commented 3 months ago

Also how to get CUDA runtime version such as cudaRuntimeGetVersion()? torch.C10_CUDA_VERSION_MAJOR seems compile time version.

HGuillemet commented 3 months ago

Thanks! How to check device compute capability? torch_cuda.getDeviceProperties() returns a plain Pointer.

Right. That's something I'm currently working on. Next version of Pytorch presets will depend on CUDA presets and this kind of function will return the proper type. In the meantime, you could directly use the CUDA presets.

Also how to get CUDA runtime version such as cudaRuntimeGetVersion()? torch.C10_CUDA_VERSION_MAJOR seems compile time version.

I'm not sure. Maybe there is a way using the CUDA presets.

I guess you'd better try to create a BF16 gpu tensor and catch the exception if the final objective is the one of your top post.

haifengl commented 3 months ago

Thanks. BTW, torch.C10_CUDA_VERSION_MAJOR and torch.C10_CUDA_VERSION are always 0, which are not correct.

It is not right just to create a BF16 tensor. On pre-ampere hardware bf16 works, but doesn't provide speed-ups compared to fp32 matmul operations, and some matmul operations are failing outright. So I would like to check cuda version and device compute capability.

saudet commented 3 months ago

I think those are just wrappers for CUDA functions anyway, so I'd try to just use these directly:

haifengl commented 3 months ago

Thanks!

haifengl commented 3 months ago

Although these methods work fine on a single GPU box, they hang on a multi-GPU box.