Closed awaelchli closed 2 years ago
Any progress?
I encountered same issues with bytegrad
Failed: Cuda error kernels/bagua_kernels.cu:285 'invalid device function
When I'am using single gpu, it works fine but the issue appears only with multi-gpus
This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!
Hello, I had the same problem with bagua-cuda113
in async
mode. The error reported is AttributeError: 'BoringModel' object has no attribute 'bagua_algorithm'
.
import os
from jsonargparse import lazy_instance
import torch
from torch.utils.data import DataLoader, Dataset
from pytorch_lightning import LightningModule, Trainer, LightningDataModule
from pytorch_lightning.utilities.cli import LightningCLI
class RandomDataset(Dataset):
def __init__(self, size, length):
self.len = length
self.data = torch.randn(length, size)
def __getitem__(self, index):
return self.data[index]
def __len__(self):
return self.len
class BoringModel(LightningModule):
def __init__(self):
super().__init__()
self.layer = torch.nn.Linear(32, 2)
def forward(self, x):
return self.layer(x)
def training_step(self, batch, batch_idx):
loss = self(batch).sum()
self.log("train_loss", loss)
return {"loss": loss}
def validation_step(self, batch, batch_idx):
loss = self(batch).sum()
self.log("valid_loss", loss)
def test_step(self, batch, batch_idx):
loss = self(batch).sum()
self.log("test_loss", loss)
def configure_optimizers(self):
return torch.optim.SGD(self.layer.parameters(), lr=0.1)
def run():
from pytorch_lightning.strategies.bagua import BaguaStrategy
train_data = DataLoader(RandomDataset(32, 64), batch_size=2)
val_data = DataLoader(RandomDataset(32, 64), batch_size=2)
test_data = DataLoader(RandomDataset(32, 64), batch_size=2)
model = BoringModel()
trainer = Trainer(
strategy=BaguaStrategy(algorithm='async'),
gpus='3,7',
)
trainer.fit(model, train_dataloaders=train_data, val_dataloaders=val_data)
trainer.test(model, dataloaders=test_data)
if __name__ == "__main__":
run()
And my environment:
The errors of Failed: Cuda error kernels/bagua_kernels.cu:628 'no kernel image is available for execution on the device
for async, low precision decentralized, bytegrad are because we are using customed CUDA kernels for these algorithms. The CUDA version on the working node should be consistent with that used for the bagua pre-compiled package.
In the above errors, the CUDA version on the working node is 11.3, while bagua-cuda111
is compiled under CUDA 11.1. We can try it using bagua-cuda113
.
Another thing to be noted is that currently all bagua pre-compiled packages are compiled on 2080-ti GPU, I'm not sure it will support 3090 GPU well.
@wangraying Thank you for your advice ^_^. I failed to build bagua
on my local machine ubuntu 22.04. but in docker it's OK.
My docker file is posted below (for anyone who need it).
FROM pytorch/pytorch:1.12.0-cuda11.3-cudnn8-devel
RUN apt update && apt install gcc curl -y
# the requirements of my project
RUN pip install jsonargparse[signatures,urls] pesq torchmetrics[audio] omegaconf pytorch-lightning rich soundfile pandas torchdata mypy yapf
# config bagua
# if githubusercontent is unavailable (e.g. China), download it first. Then copy it to the dockerfile folder
# COPY ./install.sh /root
# RUN bash /root/install.sh
RUN curl -Ls https://raw.githubusercontent.com/BaguaSys/bagua/master/install.sh | bash
RUN python -c "import bagua_core;bagua_core.install_deps()"
This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, PyTorch Lightning Team!
Closing, as it's not something we can fix. Either the Bagua folks get access to this hardware or users need to manual compile as described above.
@awaelchli @quancs I just tested the lightning example (examples/pl_basics/autoencoder.py) using the latest Bauga version (bagua-cuda116) on RTX 3090 GPU. It can run successfully with BaguaStrategy using gradient_allreduce/decentralized/async/bytegrad/low_precision_decentralized algorithms.
System
OS: Ubuntu 20.04.3 LTS(5.11.0-41) architecture: ELF 64-bit processor: x86_64 python: 3.8.10 Rust: 1.64.0
CUDA
GPU: NVIDIA GeForce RTX 3090 version: CUDA 11.6
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Fri_Dec_17_18:16:03_PST_2021
Cuda compilation tools, release 11.6, V11.6.55
Build cuda_11.6.r11.6/compiler.30794723_0
Packages
numpy 1.23.3
torch 1.12.1+cu116
torchvision 0.13.1+cu116
tqdm 4.64.1
pytorch-lightning 1.7.7
Q1. Failed: Cuda error kernels/bagua_kernels.cu:628 'no kernel image is available for execution on the device
I think you may need to install the appropriate version of torch and torchvision.
Q2. Failed: Cuda error kernels/bagua_kernels.cu:285 'invalid device function'
I think you may need upgrade Bagua to the latest version.
@awaelchli @quancs I think this is actually a problem that may be caused by the precompiled Bagua package installed by pip3 install bagua-cuda11X
.
To solve it, you can download the source code of Bagua, compile and install Bagua locally. Then you can run Pytorch Lightning successfully with BaguaStrategy using gradient_allreduce/decentralized/async/bytegrad/low_precision_decentralized algorithms.
You are correct @woqidaideshi! Thank you
🐛 Bug
To Reproduce
algorithm="gradient_all_reduce": no error algorithm="decentralized": no error
algorithm="async":
algorithm="bytegrad":
algorithm="low_precision_decentralized":
Expected behavior
No error.
Environment
Additional context
Installed
bagua-cuda111
cc @awaelchli @wangraying @akihironitta