🐛 Bug

Trying to follow along with:

https://github.com/facebookresearch/xformers#testing-the-installation

/tmp/tmpzfde7mdr/main.c:2:10: fatal error: cuda.h: No such file or directory
 #include "cuda.h"
          ^~~~~~~~
compilation terminated.
  0%|                                                    | 0/28 [00:01<?, ?it/s]
Traceback (most recent call last):
  File "<string>", line 21, in layer_norm_fw
KeyError: ('2-.-0-.-0--7929002797455b30efce6e41eddc6b57-3aa563e00c5c695dd945e23b09a86848-d962222789c30252d492a16cca3bf467-ff946bd4b3b4a4cbdf8cedc6e1c658e0-5c5e32ff210f3b7f56c98ca29917c25e-06f0df2d61979d629033f4a22eff5198-0dd03b0bd512a184b3512b278d9dfa59-d35ab04ae841e2714a253c523530b071', (torch.float32, torch.float32, torch.float32, torch.float32, torch.float32, torch.float32, 'i32', 'i32', 'fp32'), (True, 256), (True, True, True, True, True, True, (True, False), (True, False), (False,)))

During handling of the above exception, another exception occurred:
..snip..
  AttributeError: module 'triton' has no attribute 'code_gen'

Command

To Reproduce

Steps to reproduce the behavior:

!conda run -n dreambooth --live-stream python3 xformers/benchmarks/benchmark_encoder.py --activations relu --plot -emb 256 -bs 32 -heads 16

Expected behavior

The benchmark would run successfully.

Environment

⇒ python -m torch.utils.collect_env

Collecting environment information...
PyTorch version: 1.13.0+cu117
Is debug build: False
CUDA used to build PyTorch: 11.7
ROCM used to build PyTorch: N/A

OS: Ubuntu 18.04.6 LTS (x86_64)
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: Could not collect
CMake version: version 3.24.3
Libc version: glibc-2.27

Python version: 3.10.6 (main, Oct 24 2022, 16:07:47) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-5.4.0-124-generic-x86_64-with-glibc2.27
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: NVIDIA GeForce RTX 3090
Nvidia driver version: 515.65.01
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

Versions of relevant libraries:
[pip3] functorch==1.13.0
[pip3] mypy==0.812
[pip3] mypy-extensions==0.4.3
[pip3] numpy==1.23.4
[pip3] pytorch-lightning==1.8.0.post1
[pip3] torch==1.13.0
[pip3] torchmetrics==0.10.2
[pip3] torchvision==0.14.0
[conda] cudatoolkit               11.3.1               h2bc3f7f_2
[conda] functorch                 1.13.0                   pypi_0    pypi
[conda] numpy                     1.23.4                   pypi_0    pypi
[conda] pytorch-lightning         1.8.0.post1              pypi_0    pypi
[conda] torch                     1.13.0                   pypi_0    pypi
[conda] torchmetrics              0.10.2                   pypi_0    pypi
[conda] torchvision               0.14.0                   pypi_0    pypi

Additional context

Testing the following parameters: 
 {
    "activation": [
        "relu"
    ],
    "attention_name": [
        "favor",
        "blocksparse",
        "global",
        "linformer",
        "local",
        "nystrom",
        "orthoformer",
        "random",
        "scaled_dot_product",
        "compositional",
        "fourier_mix",
        "lambda",
        "pooling",
        "visual"
    ],
    "autocast": [
        true
    ],
    "batch_size": [
        32
    ],
    "causal": [
        false
    ],
    "embed_dim": [
        256
    ],
    "feedforward_name": [
        "MLP"
    ],
    "heads": [
        16
    ],
    "sequence_length": [
        576,
        1024
    ]
}
  0%|                                                    | 0/28 [00:00<?, ?it/s]Testing: xFormerEncoderBlock(
  (pose_encoding): SinePositionalEmbedding()
  (wrap_att): PostNorm(
    (norm): FusedLayerNorm()
    (sublayer): Residual(
      (layer): MultiHeadDispatch(
        (attention): FavorAttention(
          (attn_drop): Dropout(p=0.1, inplace=True)
          (feature_map): SMReg()
        )
        (in_proj_container): InputProjection(
          (q_proj): Linear(in_features=256, out_features=256, bias=True)
          (k_proj): Linear(in_features=256, out_features=256, bias=True)
          (v_proj): Linear(in_features=256, out_features=256, bias=True)
        )
        (resid_drop): Dropout(p=0.1, inplace=False)
        (proj): Linear(in_features=256, out_features=256, bias=True)
      )
    )
  )
  (wrap_ff): PostNorm(
    (norm): FusedLayerNorm()
    (sublayer): Residual(
      (layer): MLP(
        (mlp): Sequential(
          (0): Linear(in_features=256, out_features=1024, bias=True)
          (1): ReLU()
          (2): Dropout(p=0.1, inplace=False)
          (3): Linear(in_features=1024, out_features=256, bias=True)
          (4): Dropout(p=0.1, inplace=False)
        )
      )
    )
  )
) 32 576 256 True cuda favor
/tmp/tmpzfde7mdr/main.c:2:10: fatal error: cuda.h: No such file or directory
 #include "cuda.h"
          ^~~~~~~~
compilation terminated.
  0%|                                                    | 0/28 [00:01<?, ?it/s]
Traceback (most recent call last):
  File "<string>", line 21, in layer_norm_fw
KeyError: ('2-.-0-.-0--7929002797455b30efce6e41eddc6b57-3aa563e00c5c695dd945e23b09a86848-d962222789c30252d492a16cca3bf467-ff946bd4b3b4a4cbdf8cedc6e1c658e0-5c5e32ff210f3b7f56c98ca29917c25e-06f0df2d61979d629033f4a22eff5198-0dd03b0bd512a184b3512b278d9dfa59-d35ab04ae841e2714a253c523530b071', (torch.float32, torch.float32, torch.float32, torch.float32, torch.float32, torch.float32, 'i32', 'i32', 'fp32'), (True, 256), (True, True, True, True, True, True, (True, False), (True, False), (False,)))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/workspace/thelastben-diffusers/examples/dreambooth/xformers/xformers/triton/layer_norm.py", line 223, in layer_norm
    return _LayerNorm.apply(x, weight, bias, eps)
  File "/opt/conda/envs/dreambooth/lib/python3.10/site-packages/torch/cuda/amp/autocast_mode.py", line 97, in decorate_fwd
    return fwd(*args, **kwargs)
  File "/workspace/thelastben-diffusers/examples/dreambooth/xformers/xformers/triton/layer_norm.py", line 73, in forward
    layer_norm_fw[(M,)](
  File "/opt/conda/envs/dreambooth/lib/python3.10/site-packages/triton/runtime/jit.py", line 106, in launcher
    return self.run(*args, grid=grid, **kwargs)
  File "<string>", line 41, in layer_norm_fw
  File "/opt/conda/envs/dreambooth/lib/python3.10/site-packages/triton/compiler.py", line 1239, in compile
    so = _build(fn.__name__, src_path, tmpdir)
  File "/opt/conda/envs/dreambooth/lib/python3.10/site-packages/triton/compiler.py", line 1169, in _build
    ret = subprocess.check_call(cc_cmd)
  File "/opt/conda/envs/dreambooth/lib/python3.10/subprocess.py", line 369, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['/usr/bin/gcc', '/tmp/tmpzfde7mdr/main.c', '-O3', '-I/usr/local/cuda/include', '-I/opt/conda/envs/dreambooth/include/python3.10', '-I/tmp/tmpzfde7mdr', '-shared', '-fPIC', '-lcuda', '-o', '/tmp/tmpzfde7mdr/layer_norm_fw.cpython-310-x86_64-linux-gnu.so', '-L/usr/lib/x86_64-linux-gnu']' returned non-zero exit status 1.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/workspace/thelastben-diffusers/examples/dreambooth/xformers/xformers/benchmarks/benchmark_encoder.py", line 379, in <module>
    outputs = test_xformer_encoder_block(**constants, **params)  # type: ignore
  File "/workspace/thelastben-diffusers/examples/dreambooth/xformers/xformers/benchmarks/benchmark_encoder.py", line 181, in test_xformer_encoder_block
    return benchmark_model(
  File "/workspace/thelastben-diffusers/examples/dreambooth/xformers/xformers/benchmarks/benchmark_encoder.py", line 133, in benchmark_model
    _train_for_several_steps(num_steps=num_warmup, **warm_up_args)
  File "/workspace/thelastben-diffusers/examples/dreambooth/xformers/xformers/benchmarks/benchmark_encoder.py", line 99, in _train_for_several_steps
    output = block(inputs)
  File "/opt/conda/envs/dreambooth/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/workspace/thelastben-diffusers/examples/dreambooth/xformers/xformers/factory/block_factory.py", line 231, in forward
    x = self.wrap_att(inputs=[q, k, v], att_mask=att_mask)
  File "/opt/conda/envs/dreambooth/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/workspace/thelastben-diffusers/examples/dreambooth/xformers/xformers/components/residual.py", line 165, in forward
    return self.norm(x)
  File "/opt/conda/envs/dreambooth/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/workspace/thelastben-diffusers/examples/dreambooth/xformers/xformers/triton/layer_norm.py", line 193, in forward
    return layer_norm(x, self.weight, self.bias, self.epsilon)
  File "/workspace/thelastben-diffusers/examples/dreambooth/xformers/xformers/triton/layer_norm.py", line 224, in layer_norm
    except (triton.code_gen.OutOfResources, RuntimeError) as e:
AttributeError: module 'triton' has no attribute 'code_gen'
ERROR conda.cli.main_run:execute(49): `conda run python3 xformers/benchmarks/benchmark_encoder.py --activations relu --plot -emb 256 -bs 32 -heads 16` failed. (See above for error)

⇒ find / -name cuda.h
/opt/conda/envs/dreambooth/lib/python3.10/site-packages/nvidia/cuda_runtime/include/cuda.h
/opt/conda/envs/dreambooth/lib/python3.10/site-packages/torch/include/torch/csrc/api/include/torch/cuda.h
/opt/conda/lib/python3.7/site-packages/torch/include/torch/csrc/api/include/torch/cuda.h
/opt/conda/lib/python3.7/site-packages/nvidia/cuda_runtime/include/cuda.h
/opt/conda/pkgs/pytorch-1.12.0-py3.7_cuda11.3_cudnn8.3.2_0/lib/python3.7/site-packages/torch/include/torch/csrc/api/include/torch/cuda.h
find: '/proc/tty/driver': Permission denied
/usr/include/linux/cuda.h

One step closer it seems!

⇒ conda install -c nvidia cuda-libraries-dev

⇒ find / -name cuda.h
 /opt/conda/envs/dreambooth/lib/python3.10/site-packages/nvidia/cuda_runtime/include/cuda.h
 /opt/conda/envs/dreambooth/lib/python3.10/site-packages/torch/include/torch/csrc/api/include/torch/cuda.h
+ /opt/conda/envs/dreambooth/include/cuda.h
 /opt/conda/lib/python3.7/site-packages/torch/include/torch/csrc/api/include/torch/cuda.h
 /opt/conda/lib/python3.7/site-packages/nvidia/cuda_runtime/include/cuda.h
 /opt/conda/pkgs/pytorch-1.12.0-py3.7_cuda11.3_cudnn8.3.2_0/lib/python3.7/site-packages/torch/include/torch/csrc/api/include/torch/cuda.h
 /opt/conda/pkgs/cuda-cudart-dev-11.8.89-0/include/cuda.h
 find: '/proc/tty/driver': Permission denied
 /usr/include/linux/cuda.h

Yet still getting the error: fatal error: cuda.h: No such file or directory

I noticed that the call to gcc doesn't seem to pass this include path in it's -I's.. haven't dug deeper into the relevant code to figure out why/how to potentially change that:

subprocess.CalledProcessError: Command '['/usr/bin/gcc', '/tmp/tmpzfde7mdr/main.c', '-O3', '-I/usr/local/cuda/include', '-I/opt/conda/envs/dreambooth/include/python3.10', '-I/tmp/tmpzfde7mdr', '-shared', '-fPIC', '-lcuda', '-o', '/tmp/tmpzfde7mdr/layer_norm_fw.cpython-310-x86_64-linux-gnu.so', '-L/usr/lib/x86_64-linux-gnu']' returned non-zero exit status 1.

https://github.com/openai/triton/blob/master/python/triton/compiler.py#L1233-L1241
- generate_launcher seems to be generating the code that has the #include "cuda.h" line that's erroring, then calls the _build function.

https://github.com/openai/triton/blob/master/python/triton/compiler.py#L1154
- cuda_lib_dirs = libcuda_dirs()
- cu_include_dir = os.path.join(cuda_home_dirs(), "include")
- py_include_dir = get_paths()["include"]
- cc_cmd = [cc, src, "-O3", f"-I{cu_include_dir}", f"-I{py_include_dir}", f"-I{srcdir}", "-shared", "-fPIC", "-lcuda", "-o", so]

https://github.com/openai/triton/blob/master/python/triton/compiler.py#L1133-L1135
- def libcuda_dirs()
- locs = subprocess.check_output(["whereis", "libcuda.so"])

⇒ whereis libcuda.so
libcuda: /usr/lib/x86_64-linux-gnu/libcuda.so

⇒  find / -name libcuda.so
/opt/conda/envs/dreambooth/lib/stubs/libcuda.so
/opt/conda/pkgs/cuda-driver-dev-11.8.89-0/lib/stubs/libcuda.so
find: '/proc/tty/driver': Permission denied
/usr/lib/x86_64-linux-gnu/libcuda.so

https://github.com/openai/triton/blob/master/python/triton/compiler.py#L1139-L1141
- def cuda_home_dirs()
- default_dir = "/usr/local/cuda"
- return os.getenv("CUDA_HOME", default=default_dir)

⇒  ls -la /usr/local/cuda
lrwxrwxrwx 1 root root 17 Nov 10 06:36 /usr/local/cuda -> /tmp/tmpgyc5dwz3/

⇒  echo $CUDA_HOME

https://docs.python.org/3/library/sysconfig.html
- The sysconfig module provides access to Python’s configuration information like the list of installation paths and the configuration variables relevant for the current platform.
- https://docs.python.org/3/library/sysconfig.html#sysconfig.get_paths

⇒  python -c "from sysconfig import get_paths; print(get_paths()['include'])"
/opt/conda/envs/dreambooth/include/python3.10

Setting CUDA_HOME seemed to allow it to progress a little bit more, and run into a new/different error:

!CUDA_HOME=/opt/conda/envs/dreambooth conda run -n dreambooth --live-stream python3 xformers/benchmarks/benchmark_encoder.py --activations relu  --plot -emb 256 -bs 32 -heads 16

Traceback (most recent call last):
  File "<string>", line 21, in layer_norm_fw
KeyError: ('2-.-0-.-0--7929002797455b30efce6e41eddc6b57-3aa563e00c5c695dd945e23b09a86848-d962222789c30252d492a16cca3bf467-ff946bd4b3b4a4cbdf8cedc6e1c658e0-5c5e32ff210f3b7f56c98ca29917c25e-06f0df2d61979d629033f4a22eff5198-0dd03b0bd512a184b3512b278d9dfa59-d35ab04ae841e2714a253c523530b071', (torch.float32, torch.float32, torch.float32, torch.float32, torch.float32, torch.float32, 'i32', 'i32', 'fp32'), (True, 256), (True, True, True, True, True, True, (True, False), (True, False), (False,)))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/workspace/thelastben-diffusers/examples/dreambooth/xformers/xformers/triton/layer_norm.py", line 223, in layer_norm
    return _LayerNorm.apply(x, weight, bias, eps)
  File "/opt/conda/envs/dreambooth/lib/python3.10/site-packages/torch/cuda/amp/autocast_mode.py", line 97, in decorate_fwd
    return fwd(*args, **kwargs)
  File "/workspace/thelastben-diffusers/examples/dreambooth/xformers/xformers/triton/layer_norm.py", line 73, in forward
    layer_norm_fw[(M,)](
  File "/opt/conda/envs/dreambooth/lib/python3.10/site-packages/triton/runtime/jit.py", line 106, in launcher
    return self.run(*args, grid=grid, **kwargs)
  File "<string>", line 41, in layer_norm_fw
  File "/opt/conda/envs/dreambooth/lib/python3.10/site-packages/triton/compiler.py", line 1256, in compile
    asm, shared, kernel_name = _compile(fn, signature, device, constants, configs[0], num_warps, num_stages,
  File "/opt/conda/envs/dreambooth/lib/python3.10/site-packages/triton/compiler.py", line 901, in _compile
    name, asm, shared_mem = _triton.code_gen.compile_ttir(backend, module, device, num_warps, num_stages, extern_libs, cc)
RuntimeError: `ptxas` was searched in TRITON_PTXAS_PATH, /usr/local/cuda/bin/ or PATH but a working version could not be found.

Which may be related to this:

the "triton has no code_gen attritbute" is unrelated, tied to a recent triton update, sorry about that. Fixed in #528

facebookresearch / xformers