CUDA kernel error for BaguaStrategy with algorithm="async"

awaelchli commented 2 years ago

🐛 Bug

To Reproduce

import os

import torch
from torch.utils.data import DataLoader, Dataset

from pytorch_lightning import LightningModule, Trainer
from pytorch_lightning.strategies import BaguaStrategy

class RandomDataset(Dataset):
    def __init__(self, size, length):
        self.len = length
        self.data = torch.randn(length, size)

    def __getitem__(self, index):
        return self.data[index]

    def __len__(self):
        return self.len

class BoringModel(LightningModule):
    def __init__(self):
        super().__init__()
        self.layer = torch.nn.Linear(32, 2)

    def forward(self, x):
        return self.layer(x)

    def training_step(self, batch, batch_idx):
        loss = self(batch).sum()
        self.log("train_loss", loss)
        return {"loss": loss}

    def configure_optimizers(self):
        return torch.optim.SGD(self.layer.parameters(), lr=0.1)

def run():
    train_data = DataLoader(RandomDataset(32, 64), batch_size=2)

    model = BoringModel()
    trainer = Trainer(
        default_root_dir=os.getcwd(),
        limit_train_batches=1,
        num_sanity_val_steps=0,
        max_epochs=1,
        enable_model_summary=False,
        accelerator="gpu",
        devices=2,
        strategy=BaguaStrategy(algorithm="async")
    )
    trainer.fit(model, train_dataloaders=train_data)

if __name__ == "__main__":
    run()

algorithm="gradient_all_reduce": no error algorithm="decentralized": no error

algorithm="async":

Failed: Cuda error kernels/bagua_kernels.cu:628 'no kernel image is available for execution on the device

algorithm="bytegrad":

Failed: Cuda error kernels/bagua_kernels.cu:285 'invalid device function'

algorithm="low_precision_decentralized":

Failed: Cuda error kernels/bagua_kernels.cu:597 'no kernel image is available for execution on the device'

Expected behavior

No error.

Environment

* CUDA:
        - GPU:
                - NVIDIA GeForce RTX 3090
                - NVIDIA GeForce RTX 3090
                - NVIDIA GeForce RTX 3090
                - NVIDIA GeForce RTX 3090
                - NVIDIA GeForce RTX 3090
                - NVIDIA GeForce RTX 3090
                - NVIDIA GeForce RTX 3090
                - NVIDIA GeForce RTX 3090
        - available:         True
        - version:           11.3
* Packages:
        - numpy:             1.21.2
        - pyTorch_debug:     False
        - pyTorch_version:   1.11.0
        - pytorch-lightning: 1.7.0dev
        - tqdm:              4.62.3
* System:
        - OS:                Linux
        - architecture:
                - 64bit
                - ELF
        - processor:         x86_64
        - python:            3.9.7
        - version:           #64-Ubuntu SMP Wed Dec 9 08:16:25 UTC 2020

Additional context

Installed bagua-cuda111

nvcc --version 

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Mon_Oct_12_20:09:46_PDT_2020
Cuda compilation tools, release 11.1, V11.1.105
Build cuda_11.1.TC455_06.29190527_0

cc @awaelchli @wangraying @akihironitta

redleaf-kim commented 2 years ago

Any progress?

I encountered same issues with bytegrad Failed: Cuda error kernels/bagua_kernels.cu:285 'invalid device function

When I'am using single gpu, it works fine but the issue appears only with multi-gpus

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

quancs commented 2 years ago

Hello, I had the same problem with bagua-cuda113 in async mode. The error reported is AttributeError: 'BoringModel' object has no attribute 'bagua_algorithm'.

import os
from jsonargparse import lazy_instance

import torch
from torch.utils.data import DataLoader, Dataset

from pytorch_lightning import LightningModule, Trainer, LightningDataModule
from pytorch_lightning.utilities.cli import LightningCLI

class RandomDataset(Dataset):

    def __init__(self, size, length):
        self.len = length
        self.data = torch.randn(length, size)

    def __getitem__(self, index):
        return self.data[index]

    def __len__(self):
        return self.len

class BoringModel(LightningModule):

    def __init__(self):
        super().__init__()
        self.layer = torch.nn.Linear(32, 2)

    def forward(self, x):
        return self.layer(x)

    def training_step(self, batch, batch_idx):
        loss = self(batch).sum()
        self.log("train_loss", loss)
        return {"loss": loss}

    def validation_step(self, batch, batch_idx):
        loss = self(batch).sum()
        self.log("valid_loss", loss)

    def test_step(self, batch, batch_idx):
        loss = self(batch).sum()
        self.log("test_loss", loss)

    def configure_optimizers(self):
        return torch.optim.SGD(self.layer.parameters(), lr=0.1)

def run():
    from pytorch_lightning.strategies.bagua import BaguaStrategy

    train_data = DataLoader(RandomDataset(32, 64), batch_size=2)
    val_data = DataLoader(RandomDataset(32, 64), batch_size=2)
    test_data = DataLoader(RandomDataset(32, 64), batch_size=2)

    model = BoringModel()
    trainer = Trainer(
        strategy=BaguaStrategy(algorithm='async'),
        gpus='3,7',
    )
    trainer.fit(model, train_dataloaders=train_data, val_dataloaders=val_data)
    trainer.test(model, dataloaders=test_data)

if __name__ == "__main__":
    run()

And my environment:

CUDA:
- GPU:
  - NVIDIA A100-SXM4-80GB
  - NVIDIA A100-SXM4-80GB
  - NVIDIA A100-SXM4-80GB
  - NVIDIA A100-SXM4-80GB
  - NVIDIA A100-SXM4-80GB
  - NVIDIA A100-SXM4-80GB
  - NVIDIA A100-SXM4-80GB
  - NVIDIA A100-SXM4-80GB
- available: True
- version: 11.6
Packages:
- numpy: 1.22.4
- pyTorch_debug: False
- pyTorch_version: 1.12.0+cu116
- pytorch-lightning: 1.6.4
- tqdm: 4.64.0
System:
- OS: Linux
- architecture:
  - 64bit
  - ELF
- processor: x86_64
- python: 3.9.7
- version: #43-Ubuntu SMP Wed Jun 15 12:54:21 UTC 2022

wangraying commented 2 years ago

The errors of Failed: Cuda error kernels/bagua_kernels.cu:628 'no kernel image is available for execution on the device for async, low precision decentralized, bytegrad are because we are using customed CUDA kernels for these algorithms. The CUDA version on the working node should be consistent with that used for the bagua pre-compiled package.

In the above errors, the CUDA version on the working node is 11.3, while bagua-cuda111 is compiled under CUDA 11.1. We can try it using bagua-cuda113.

Another thing to be noted is that currently all bagua pre-compiled packages are compiled on 2080-ti GPU, I'm not sure it will support 3090 GPU well.

wangraying commented 2 years ago

@quancs It seems you are using CUDA 11.6 on your working node and Pytorch. bagua-cuda113 is compiled under CUDA 11.3. We currently does not support pre-compiled packages for CUDA11.6.

You may install bagua manually follow the tutorials here.

quancs commented 2 years ago

@wangraying Thank you for your advice ^_^. I failed to build bagua on my local machine ubuntu 22.04. but in docker it's OK. My docker file is posted below (for anyone who need it).

FROM pytorch/pytorch:1.12.0-cuda11.3-cudnn8-devel
RUN apt update && apt install gcc curl -y
# the requirements of my project
RUN pip install jsonargparse[signatures,urls] pesq torchmetrics[audio] omegaconf pytorch-lightning rich soundfile pandas torchdata mypy yapf

# config bagua
# if githubusercontent is unavailable (e.g. China), download it first. Then copy it to the dockerfile folder
# COPY ./install.sh /root
# RUN bash /root/install.sh
RUN curl -Ls https://raw.githubusercontent.com/BaguaSys/bagua/master/install.sh | bash
RUN python -c "import bagua_core;bagua_core.install_deps()"

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, PyTorch Lightning Team!

carmocca commented 2 years ago

Closing, as it's not something we can fix. Either the Bagua folks get access to this hardware or users need to manual compile as described above.

woqidaideshi commented 2 years ago

@awaelchli @quancs I just tested the lightning example (examples/pl_basics/autoencoder.py) using the latest Bauga version (bagua-cuda116) on RTX 3090 GPU. It can run successfully with BaguaStrategy using gradient_allreduce/decentralized/async/bytegrad/low_precision_decentralized algorithms.

My environment:

System

OS: Ubuntu 20.04.3 LTS（5.11.0-41） architecture: ELF 64-bit processor: x86_64 python: 3.8.10 Rust: 1.64.0

CUDA

GPU: NVIDIA GeForce RTX 3090 version: CUDA 11.6

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Fri_Dec_17_18:16:03_PST_2021
Cuda compilation tools, release 11.6, V11.6.55
Build cuda_11.6.r11.6/compiler.30794723_0

Packages

numpy                   1.23.3
torch                   1.12.1+cu116
torchvision             0.13.1+cu116
tqdm                    4.64.1
pytorch-lightning       1.7.7

Q1. Failed: Cuda error kernels/bagua_kernels.cu:628 'no kernel image is available for execution on the device

I think you may need to install the appropriate version of torch and torchvision.

Q2. Failed: Cuda error kernels/bagua_kernels.cu:285 'invalid device function'

I think you may need upgrade Bagua to the latest version.

woqidaideshi commented 2 years ago

@awaelchli @quancs I think this is actually a problem that may be caused by the precompiled Bagua package installed by pip3 install bagua-cuda11X.

To solve it, you can download the source code of Bagua, compile and install Bagua locally. Then you can run Pytorch Lightning successfully with BaguaStrategy using gradient_allreduce/decentralized/async/bytegrad/low_precision_decentralized algorithms.

carmocca commented 2 years ago

You are correct @woqidaideshi! Thank you

Lightning-AI / pytorch-lightning