NVIDIA / apex

A PyTorch Extension: Tools for easy mixed precision and distributed training in Pytorch
BSD 3-Clause "New" or "Revised" License
8.16k stars 1.35k forks source link

Cannot compile/build cuda_ext on H100 #1778

Open GuanhuaWang opened 4 months ago

GuanhuaWang commented 4 months ago

Describe the Bug

Try install on HGX-H100 nodes, pip install cannot enable build on cuda extensions like amp_C, etc.

Minimal Steps/Code to Reproduce the Bug

pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./

also tried with -e, not helpful.

Expected Behavior

compile and build cuda extensions successfully.

Environment

cuda 12.2, torch 2.2.1

My temporary fix

My temporary fix is comment out check_cuda_torch_binary_vs_bare_metal in setup.py which force cuda_extension to build.