allegroai / clearml

ClearML - Auto-Magical CI/CD to streamline your AI workload. Experiment Management, Data Management, Pipeline, Orchestration, Scheduling & Serving in one MLOps/LLMOps solution
https://clear.ml/docs
Apache License 2.0
5.58k stars 644 forks source link

clearml does not support pytorch-lightning with multi-gpus #635

Open manelabinyamin opened 2 years ago

manelabinyamin commented 2 years ago

Hi, I am trying to run clearml with pytorch-lightning on multiple gpus, but the agent does not catch anything that happens within the fit function (progress bar, tensorboard scalars/plots ets.) when using pytorch on multi-gpu or pytorch-lightning on CPU / single GPU everything works fine. To be sure, I also ran your example code on multi-gpu and it didn't work either (see the attached file for the corresponding adjustments).

specs: ubuntu 20.04 dgx (A100X8) python 3.8.12 CUDA 11.4 torch 1.10.0 pytorch-lightning 1.6.0 cleaml 1.3.0 clearml-agent 1.1.2

clearml_example.txt

erezalg commented 2 years ago

Hi @manelabinyamin,

I'll take a look at it, I (unfortunately) still do not have a DGX, but I'll hunt a machine with multiple GPU's :)

manelabinyamin commented 2 years ago

Thanks :) Please keep me updated

erezalg commented 2 years ago

Hi @manelabinyamin ,

Was able to reproduce this issue. I just want to make sure what we're seeing is the same. The only difference between multi GPU and single GPU is I don't see some of the metrics reported, which are: "epoch", "test_loss" and "valid_loss". On multi GPU I do see "hp_metric" I also see progress bar (but less reports, I guess it's because of more processing power?). I don't see on neither of them any plots so could not compare.

Let me know if this is what you also see and I can move forward with fixing this :)

manelabinyamin commented 2 years ago

Hi @erezalg, I know there is no use of plots in the example script, but from my experience, it won't work either. In general, the agent won't capture anything within the 'fit' function (the training loop), and that is why you can see the hp metric, etc. A simple test you can run is trying to plot an empty image inside the training step by replacing with the following code.

    def training_step(self, batch, batch_idx):
        x, y = batch
        y_hat = self(x)
        loss = F.cross_entropy(y_hat, y)
        self.logger.experiment.add_image('check_plot', np.zeros(10,10, 1), self.global_step, , dataformats='HWC')
        return loss

My best guess is that there is some problem with the devices' ranks...

thanks a lot!

erezalg commented 2 years ago

Hi @manelabinyamin ,

Yeah that makes sense :) We're looking into it and hopefully will come up with a solution soon!

Rizwan-Hasan commented 2 years ago

Hi @manelabinyamin ,

Sorry to be late. We've looked into it and the great news is that we've found a solution. Always you've to initialize the Task before writing any model code.

Here I've edited the code you mentioned on clearml_example.txt and it will work now,

from argparse import ArgumentParser
import torch
import pytorch_lightning as pl
from torch.nn import functional as F
from torch.utils.data import DataLoader, random_split
from clearml import Task

from torchvision.datasets.mnist import MNIST
from torchvision import transforms

# Connecting ClearML with the current process,
# from here on everything is logged automatically
task = Task.init(project_name="examples", task_name="PyTorch lightning MNIST example")

class LitClassifier(pl.LightningModule):
    def __init__(self, hidden_dim=128, learning_rate=1e-3):
        super().__init__()
        self.save_hyperparameters()

        self.l1 = torch.nn.Linear(28 * 28, self.hparams.hidden_dim)
        self.l2 = torch.nn.Linear(self.hparams.hidden_dim, 10)

    def forward(self, x):
        x = x.view(x.size(0), -1)
        x = torch.relu(self.l1(x))
        x = torch.relu(self.l2(x))
        return x

    def training_step(self, batch, batch_idx):
        x, y = batch
        y_hat = self(x)
        loss = F.cross_entropy(y_hat, y)
        return loss

    def validation_step(self, batch, batch_idx):
        x, y = batch
        y_hat = self(x)
        loss = F.cross_entropy(y_hat, y)
        self.log('valid_loss', loss)

    def test_step(self, batch, batch_idx):
        x, y = batch
        y_hat = self(x)
        loss = F.cross_entropy(y_hat, y)
        self.log('test_loss', loss)

    def configure_optimizers(self):
        return torch.optim.Adam(self.parameters(), lr=self.hparams.learning_rate)

    @staticmethod
    def add_model_specific_args(parent_parser):
        parser = ArgumentParser(parents=[parent_parser], add_help=False)
        parser.add_argument('--hidden_dim', type=int, default=128)
        parser.add_argument('--learning_rate', type=float, default=0.0001)
        return parser

if __name__ == '__main__':
    pl.seed_everything(0)

    parser = ArgumentParser()
    parser.add_argument('--batch_size', default=32, type=int)
    parser = pl.Trainer.add_argparse_args(parser)
    parser.set_defaults(max_epochs=3, gpus=8)
    parser = LitClassifier.add_model_specific_args(parser)
    args = parser.parse_args()

    # ------------
    # data
    # ------------
    dataset = MNIST('', train=True, download=True, transform=transforms.ToTensor())
    mnist_test = MNIST('', train=False, download=True, transform=transforms.ToTensor())
    mnist_train, mnist_val = random_split(dataset, [55000, 5000])

    train_loader = DataLoader(mnist_train, batch_size=args.batch_size)
    val_loader = DataLoader(mnist_val, batch_size=args.batch_size)
    test_loader = DataLoader(mnist_test, batch_size=args.batch_size)

    # ------------
    # model
    # ------------
    model = LitClassifier(args.hidden_dim, args.learning_rate)

    # ------------
    # training
    # ------------
    trainer = pl.Trainer.from_argparse_args(args)
    trainer.fit(model, train_loader, val_loader)

    # ------------
    # testing
    # ------------
    trainer.test(test_dataloaders=test_loader)
Rizwan-Hasan commented 2 years ago

Hi @manelabinyamin ,

Are you still facing it? Have you applied our solution? Please let us know.

ssetu commented 1 year ago

Hi @Rizwan-Hasan , I am using clearml-agent version 1.4.1 and clearml version 1.8.0, and this is not working for multi-gpus. I am using the example script at - https://github.com/allegroai/clearml/blob/master/examples/frameworks/pytorch-lightning/pytorch_lightning_example.py I made a small modification to test on cpu only machines- Replaced parser.set_defaults(max_epochs=3) with

if torch.cuda.is_available():
        parser.set_defaults(max_epochs=3, accelerator="gpu", devices=-1)
else:
        parser.set_defaults(max_epochs=3)

Here are the results:

  1. Doesn't work with devices = -1 on a 8 GPU machine
  2. Works with devices = -1 on a single GPU machine
  3. Works with devices = 1 on a 8 GPU machine So, I am only able to use one GPU at a time. The tail of execution log is below:
 Environment setup completed successfully
 Starting Task Execution:
 2022-11-22 13:20:31
 ClearML results page: https://app.clearml.dev.xxx.net/projects/1711e7e1538f454186422bc88362ad4b/experiments/9103459c7f70447b9ce08eaef21f4659/output/log
 Global seed set to 0
 Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
 Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz to MNIST/raw/train-images-idx3-ubyte.gz
 100% 9912422/9912422 [00:00<00:00, 55682849.90it/s]
 Extracting MNIST/raw/train-images-idx3-ubyte.gz to MNIST/raw
 2022-11-22 13:20:36
 Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz
 Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz to MNIST/raw/train-labels-idx1-ubyte.gz
 100% 28881/28881 [00:00<00:00, 5670084.90it/s]
 Extracting MNIST/raw/train-labels-idx1-ubyte.gz to MNIST/raw
 Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz
 Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz to MNIST/raw/t10k-images-idx3-ubyte.gz
 100% 1648877/1648877 [00:00<00:00, 13600305.59it/s]
 Extracting MNIST/raw/t10k-images-idx3-ubyte.gz to MNIST/raw
 Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz
 Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz to MNIST/raw/t10k-labels-idx1-ubyte.gz
 100% 4542/4542 [00:00<00:00, 18006170.86it/s]
 Extracting MNIST/raw/t10k-labels-idx1-ubyte.gz to MNIST/raw
 GPU available: True (cuda), used: True
 TPU available: False, using: 0 TPU cores
 IPU available: False, using: 0 IPUs
 HPU available: False, using: 0 HPUs
 Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/8
 2022-11-22 13:20:41
 Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/8
 2022-11-22 13:20:47
 Initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/8
 2022-11-22 13:20:52
 Initializing distributed: GLOBAL_RANK: 3, MEMBER: 4/8
 Initializing distributed: GLOBAL_RANK: 4, MEMBER: 5/8
 2022-11-22 13:20:57
 Initializing distributed: GLOBAL_RANK: 5, MEMBER: 6/8
 2022-11-22 13:21:02
 Initializing distributed: GLOBAL_RANK: 6, MEMBER: 7/8
 2022-11-22 13:21:07
 Initializing distributed: GLOBAL_RANK: 7, MEMBER: 8/8
 ----------------------------------------------------------------------------------------------------
 distributed_backend=nccl
 All distributed processes registered. Starting with 8 processes
 ----------------------------------------------------------------------------------------------------
 Missing logger folder: /root/.clearml/venvs-builds/3.9/task_repository/ai-clearml-train-boilerplate.git/lightning_logs
 Missing logger folder: /root/.clearml/venvs-builds/3.9/task_repository/ai-clearml-train-boilerplate.git/lightning_logs
 Missing logger folder: /root/.clearml/venvs-builds/3.9/task_repository/ai-clearml-train-boilerplate.git/lightning_logs
 Missing logger folder: /root/.clearml/venvs-builds/3.9/task_repository/ai-clearml-train-boilerplate.git/lightning_logs
 Missing logger folder: /root/.clearml/venvs-builds/3.9/task_repository/ai-clearml-train-boilerplate.git/lightning_logs
 Missing logger folder: /root/.clearml/venvs-builds/3.9/task_repository/ai-clearml-train-boilerplate.git/lightning_logs
 Missing logger folder: /root/.clearml/venvs-builds/3.9/task_repository/ai-clearml-train-boilerplate.git/lightning_logs
 Missing logger folder: /root/.clearml/venvs-builds/3.9/task_repository/ai-clearml-train-boilerplate.git/lightning_logs
 2022-11-22 13:23:32
 ClearML Monitor: Could not detect iteration reporting, falling back to iterations as seconds-from-start

The task execution is not logged after this point, there seems to be no progress even after a long time, and the CPU usage is stuck at around 25% while the GPU usage is 0.

I am using the following command to start the task: clearml-task --project ClearMLpractice --name hello_ptl --repo git@github.com:xx/xx.git --branch master --script pytorch_lightning/ptl_mnist.py --args batch_size=64 max_epochs=30 --docker pytorch/pytorch:1.13.0-cuda11.6-cudnn8-runtime --docker_args "-v /home/xxx/.ssh:/root/.ssh:ro" --queue default

ssetu commented 1 year ago

Running the same example on a 8GPU machine directly leads to the following issue:

Traceback (most recent call last):
  File "test.py", line 93, in <module>
    trainer.fit(model, train_loader, val_loader)
  File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 582, in fit
    call._call_and_handle_interrupt(
  File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/call.py", line 36, in _call_and_handle_interrupt
    return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/strategies/launchers/multiprocessing.py", line 113, in launch
    mp.start_processes(
  File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
    while not context.join():
  File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 140, in join
    raise ProcessExitedException(
torch.multiprocessing.spawn.ProcessExitedException: process 0 terminated with signal SIGBUS
Rizwan-Hasan commented 1 year ago

Hi @ssetu,

I'll take a look at it your issue, and update you soon.

ssetu commented 1 year ago

@Rizwan-Hasan I've fond a solution. Multi-GPU requires interprocess communication. So either the --ipc=host flag should be used or larger shared memory needs to be allocated using the --shm-size flag.

Rizwan-Hasan commented 1 year ago

@ssetu That's good to hear. Can you please comment the solution code here?

ssetu commented 1 year ago

One solution is to add this line in the clear.conf file of the agent: extra_docker_arguments: ["--ipc=host", ] Alternatively, we can also use a larger shared memory by specifying extra_docker_arguments: ["--shm-size=8g", ] Be careful not to exceed your RAM size when using the latter.

kampelmuehler commented 1 year ago

For me the only solution I found to make clearml log scalars when using multiple GPUs is to make the Task part of the LightningModule, i.e.:

class LitClassifier(pl.LightningModule):
    def __init__(self, hidden_dim=128, learning_rate=1e-3):
        super().__init__()
        self.task = Task.init(...)