facebookresearch / hanabi_SAD

Simplified Action Decoder for Deep Multi-Agent Reinforcement Learning
Other
96 stars 35 forks source link

CUDA error: no kernel image is available for execution on the device when running sad_2player.sh #10

Open rocanaan opened 4 years ago

rocanaan commented 4 years ago

Hello,

I haven't been able to train the example provided by sad_2player.sh. I am getting the attached output:

SAD output.pdf

With the most relevant part seen below:

warming up replay buffer: 0
terminate called after throwing an instance of 'std::runtime_error'
  what():  CUDA error: no kernel image is available for execution on the device
The above operation failed in interpreter, with the following stack trace:
at /home/jupyter/anaconda3/envs/HanabiSAD/lib/python3.7/site-packages/torch/nn/functional.py:1374:12
        - Bias: :math:`(out\_features)`
        - Output: :math:`(N, *, out\_features)`
    """
    if input.dim() == 2 and bias is not None:
        # fused op is marginally faster
        ret = torch.addmm(bias, input, weight.t())
    else:
        output = input.matmul(weight.t())
        if bias is not None:
            output += bias
            ~~~~~~~~~~~~~~ <--- HERE
        ret = output
    return ret

Aborted (core dumped)

It seems to be some problem with CUDA compatibility. My cuda:0 and cuda:1 devices are a pair of GeForce GTX 1080 Ti, but cuda:2 is a GeForce GTX TITAN X. I am able to run dev.sh (although slow - see #9 ), but looking at the parameters it seems to use only cuda:0 and cuda:1. Is this an issue of compatibility with the Titan X? If so, are there any workarounds?

Thank you!

hengyuan-hu commented 4 years ago

Did you compile pytorch from source, as indicated in the readme? If so, you may need to change TORCH_CUDA_ARCH_LIST="6.0;7.0" to also include CUDA_ATCH_LIST for Titan X. I don't know what is that number for your card, but you should definitely include it if that is not 6.0 or 7.0.