Lightning-AI / pytorch-lightning

Pretrain, finetune and deploy AI models on multiple GPUs, TPUs with zero code changes.
https://lightning.ai
Apache License 2.0
27.96k stars 3.35k forks source link

"TORCH_USE_CUDA_DSA" error when using SyncBatchNorm and DDP #17231

Closed rafathasan closed 8 months ago

rafathasan commented 1 year ago

Bug description

Title: "TORCH_USE_CUDA_DSA" error when using SyncBatchNorm and DDP with ModelCheckpoint in multi-GPU semantic segmentation training

Description:

I am training a multi-model for semantic segmentation using SyncBatchNorm and DDP with 1 node and 8 GPUs. The training works fine without any errors, but when I plug in ModelCheckpoint in trainer's callbacks, it gives a "TORCH_USE_CUDA_DSA" error at arbitrary epochs.

Steps to reproduce:

Todo

Expected behavior:

ModelCheckpoint should save the model's state without any errors.

Actual behavior:

ModelCheckpoint saves the last model's state and "TORCH_USE_CUDA_DSA" error is raised at arbitrary epochs when ModelCheckpoint is used.

Reproducibility:

Always

This what I did to try to avoid the error but failed:

yaml config


datasets:
  train_batch_size: 3
  val_batch_size: 16
  test_batch_size: 16
trainer:
  train:
    devices: 8
    accelerator: cuda
    num_nodes: 1
    strategy: ddp
    sync_batchnorm: True
    max_epochs: 100
    limit_train_batches: 1
    log_every_n_steps: 1
  logger:
    CSVLogger:
      save_dir: logs/
      name: csv_cps
  callbacks:
    ModelCheckpoint: 
      dirpath: checkpoints/
      every_n_epochs: 0
      save_last: True

train.py


import argparse
import pytorch_lightning as pl
from pytorch_lightning.loggers import CSVLogger
from pytorch_lightning.callbacks import ModelCheckpoint
from models.TSS import TSS
from torchsummary import summary
import torch
import argparse
import os
from utils import DatasetDownloader, Config
from datasets import BingRGB
from pytorch_lightning.plugins import TorchSyncBatchNorm

config = Config("./config/config.yaml")
parser = argparse.ArgumentParser()
parser.add_argument('-c','--ckpt_path', type=str, help='Path to checkpoint file', default=None)
parser.add_argument('--init_weight_lr', type=float, help='The value for init_weight_lr.', default=1e-5)
parser.add_argument('--init_weight_momentum', type=float, help='The value for init_weight_momentum.', default=0.1)
parser.add_argument('--lr', type=float, help='The value for lr.', default=1e-3)
args = parser.parse_args()

if not args.ckpt_path or not os.path.exists(args.ckpt_path):
    args.ckpt_path = None

data_module = BingRGB(**config.datasets_config)

model = TSS(lr=args.lr, init_weight_lr=args.init_weight_lr, init_weight_momentum=args.init_weight_momentum, download_config=config.config_dict.datasets.download.weight)

trainer = pl.Trainer(**config.train_config,
callbacks=[
    ModelCheckpoint(**config.config_dict.trainer.callbacks.ModelCheckpoint)
],
logger=[
    CSVLogger(**config.config_dict.trainer.logger.CSVLogger),
])

trainer.fit(model, data_module, ckpt_path=args.ckpt_path)

pl.Module


import torch
import pytorch_lightning as pl
import torch.nn as nn
from torch.optim.lr_scheduler import LambdaLR, StepLR, ReduceLROnPlateau
from utils.init_func import init_weight, group_weight
from torch.nn import SyncBatchNorm
import torchvision
from utils import DatasetDownloader, Config
from models.components import SingleNetwork, Head

class TSS(pl.LightningModule):
    def __init__(self, lr, init_weight_lr, init_weight_momentum, download_config):
        super(TSS, self).__init__()

        ## hyperparameters ##

        self.lr = lr
        self.num_classes = 6
        self.pretrained_model = "data/pytorch-weight/resnet50_v1c.pth"
        self.automatic_optimization = False
        self.criterion = nn.CrossEntropyLoss(reduction='mean', ignore_index=255)

        self.BatchNorm = SyncBatchNorm

        DatasetDownloader(**download_config).download()

        ## hyperparameters ##

        ## network ##

        self.branch1 = SingleNetwork(self.num_classes, self.BatchNorm, self.pretrained_model)
        self.branch2 = SingleNetwork(self.num_classes, self.BatchNorm, self.pretrained_model)

        ## network ##

        init_weight(self.branch1.business_layer, nn.init.kaiming_normal_,
                    self.BatchNorm, init_weight_lr, init_weight_momentum,
                    mode='fan_in', nonlinearity='relu')
        init_weight(self.branch2.business_layer, nn.init.kaiming_normal_,
                    self.BatchNorm, init_weight_lr, init_weight_momentum,
                    mode='fan_in', nonlinearity='relu')

        self.save_hyperparameters()

    def forward(self, x, branch=1):
        if not self.training:
            return self.branch1(x)

        if branch == 1:
            return self.branch1(x)
        elif branch == 2:
            return self.branch2(x)

    def training_step(self, batch, batch_idx):

        unsup_imgs, imgs, gts = batch
        gts = gts.squeeze()

        pred_sup_l = self(imgs, branch=1)
        pred_unsup_l = self(unsup_imgs, branch=1)
        pred_sup_r = self(imgs, branch=2)
        pred_unsup_r = self(unsup_imgs, branch=2)

        ### cps loss ###
        pred_l = torch.cat([pred_sup_l, pred_unsup_l], dim=0)
        pred_r = torch.cat([pred_sup_r, pred_unsup_r], dim=0)
        _, max_l = torch.max(pred_l, dim=1)
        _, max_r = torch.max(pred_r, dim=1)
        max_l = max_l.long()
        max_r = max_r.long()
        cps_loss = self.criterion(pred_l, max_r) + self.criterion(pred_r, max_l)

        ### standard cross entropy loss ###
        loss_sup = self.criterion(pred_sup_l, gts)

        loss_sup_r = self.criterion(pred_sup_r, gts)

        loss = loss_sup + loss_sup_r + cps_loss * 1.5

        loss.backward()

        self.optimizers()[0].step()
        self.optimizers()[1].step()

        self.optimizers()[0].zero_grad()
        self.optimizers()[1].zero_grad()

        self.log("lr", self.optimizers()[0].param_groups[0]['lr'], on_step=False,
                 on_epoch=True, prog_bar=True, logger=True, sync_dist=True)
        self.log("loss", loss, on_step=False,
                 on_epoch=True, prog_bar=True, logger=True, sync_dist=True)

    def configure_optimizers(self):
        params_list_l = []

        params_list_l = group_weight(params_list_l, self.branch1.backbone,
                                     self.BatchNorm, self.lr)
        for module in self.branch1.business_layer:
            params_list_l = group_weight(
                params_list_l, module, self.BatchNorm, self.lr)

        optimizer_l = torch.optim.SGD(
            params_list_l,
            lr=self.lr,
            momentum=0.1,
            weight_decay=1e-4)

        params_list_r = []
        params_list_r = group_weight(params_list_r, self.branch2.backbone,
                                     self.BatchNorm, self.lr)
        for module in self.branch2.business_layer:
            params_list_r = group_weight(
                params_list_r, module, self.BatchNorm, self.lr)

        optimizer_r = torch.optim.SGD(
            params_list_r,
            lr=self.lr,
            momentum=0.1,
            weight_decay=1e-4)
        max_steps = self.trainer.datamodule.train_dataloader().batch_size
        max_iters = self.trainer.max_epochs * max_steps
        scheduler = [
            StepLR(optimizer_l, step_size=1,
                                gamma=0.9),
            StepLR(optimizer_r, step_size=1,
                                gamma=0.9),
        ]
        return [optimizer_l, optimizer_r], scheduler

Error messages and logs


(base) root@d92d2c8bfb13:/src# python train.py 
Unzipping file...
Extracting files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 13842.59it/s]
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
`Trainer(limit_train_batches=1)` was configured so 1 batch per epoch will be used.
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/8
Unzipping file...
Extracting files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 14979.66it/s]
Unzipping file...
Unzipping file...
Extracting files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 13066.37it/s]
Extracting files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 12787.51it/s]
Unzipping file...
Unzipping file...
Extracting files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 27776.85it/s]
Extracting files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 27413.75it/s]
Unzipping file...
Extracting files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 15947.92it/s]
Unzipping file...
Extracting files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 24314.81it/s]
Initializing distributed: GLOBAL_RANK: 7, MEMBER: 8/8
Initializing distributed: GLOBAL_RANK: 6, MEMBER: 7/8
Initializing distributed: GLOBAL_RANK: 3, MEMBER: 4/8
Initializing distributed: GLOBAL_RANK: 4, MEMBER: 5/8
Initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/8
Initializing distributed: GLOBAL_RANK: 5, MEMBER: 6/8
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/8
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 8 processes
----------------------------------------------------------------------------------------------------

/opt/conda/lib/python3.10/site-packages/pytorch_lightning/callbacks/model_checkpoint.py:612: UserWarning: Checkpoint directory /src/checkpoints exists and is not empty.
  rank_zero_warn(f"Checkpoint directory {dirpath} exists and is not empty.")
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 4 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 6 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 5 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 2 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 7 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 3 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]

  | Name      | Type             | Params
-----------------------------------------------
0 | criterion | CrossEntropyLoss | 0     
1 | branch1   | SingleNetwork    | 40.5 M
2 | branch2   | SingleNetwork    | 40.5 M
-----------------------------------------------
80.9 M    Trainable params
0         Non-trainable params
80.9 M    Total params
323.777   Total estimated model params size (MB)
/opt/conda/lib/python3.10/site-packages/lightning_fabric/loggers/csv_logs.py:188: UserWarning: Experiment logs directory logs/csv_cps/version_0 exists and is not empty. Previous log files in this directory will be deleted when the new ones are saved!
  rank_zero_warn(
Epoch 3:   0%|                                                                                                             | 0/1 [00:00<?, ?it/s, v_num=0, miou=0.0754, lr=0.00081, loss=12.10]Traceback (most recent call last):                                                                                                                                                             
  File "/opt/conda/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 42, in _call_and_handle_interrupt
    return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 92, in launch
    return function(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 559, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/opt/conda/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 935, in _run
    results = self._run_stage()
  File "/opt/conda/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 978, in _run_stage
    self.fit_loop.run()
  File "/opt/conda/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 201, in run
    self.advance()
  File "/opt/conda/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 354, in advance
    self.epoch_loop.run(self._data_fetcher)
  File "/opt/conda/lib/python3.10/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 133, in run
    self.advance(data_fetcher)
  File "/opt/conda/lib/python3.10/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 220, in advance
    batch_output = self.manual_optimization.run(kwargs)
  File "/opt/conda/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/manual.py", line 90, in run
    self.advance(kwargs)
  File "/opt/conda/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/manual.py", line 109, in advance
    training_step_output = call._call_strategy_hook(trainer, "training_step", *kwargs.values())
  File "/opt/conda/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 288, in _call_strategy_hook
    output = fn(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/pytorch_lightning/strategies/ddp.py", line 329, in training_step
    return self.model(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1156, in forward
    output = self._run_ddp_forward(*inputs, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1110, in _run_ddp_forward
    return module_to_run(*inputs[0], **kwargs[0])  # type: ignore[index]
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/pytorch_lightning/overrides/base.py", line 90, in forward
    output = self._forward_module.training_step(*inputs, **kwargs)
  File "/src/models/TSS.py", line 74, in training_step
    pred_sup_l = self(imgs, branch=1)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/src/models/TSS.py", line 65, in forward
    return self.branch1(x)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/src/models/components.py", line 31, in forward
    feature_maps = self.backbone(data)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/src/models/base_model/resnet.py", line 170, in forward
    x = self.conv1(x)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/container.py", line 217, in forward
    input = module(input)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/batchnorm.py", line 753, in forward
    return sync_batch_norm.apply(
  File "/opt/conda/lib/python3.10/site-packages/torch/autograd/function.py", line 506, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/_functions.py", line 83, in forward
    count_all = count_all[mask]
RuntimeError: CUDA error: the launch timed out and was terminated
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Environment

* CUDA:
        - GPU:
                - Tesla K80
                - Tesla K80
                - Tesla K80
                - Tesla K80
                - Tesla K80
                - Tesla K80
                - Tesla K80
                - Tesla K80
        - available:         True
        - version:           11.7
* Lightning:
        - lightning-utilities: 0.8.0
        - pytorch-lightning: 2.0.0
        - torch:             2.0.0
        - torchelastic:      0.2.2
        - torchmetrics:      0.11.4
        - torchsummary:      1.5.1
        - torchtext:         0.14.1
        - torchvision:       0.15.1
* Packages:
        - aiohttp:           3.8.4
        - aiosignal:         1.3.1
        - antlr4-python3-runtime: 4.9.3
        - anyio:             3.6.2
        - appdirs:           1.4.4
        - argon2-cffi:       21.3.0
        - argon2-cffi-bindings: 21.2.0
        - arrow:             1.2.3
        - asttokens:         2.0.5
        - astunparse:        1.6.3
        - async-timeout:     4.0.2
        - attrs:             22.1.0
        - autopep8:          2.0.2
        - backcall:          0.2.0
        - beautifulsoup4:    4.11.1
        - bleach:            6.0.0
        - brotlipy:          0.7.0
        - certifi:           2022.9.24
        - cffi:              1.15.1
        - chardet:           4.0.0
        - charset-normalizer: 2.0.4
        - click:             8.1.3
        - cmake:             3.26.1
        - comm:              0.1.3
        - conda:             22.11.1
        - conda-build:       3.23.3
        - conda-package-handling: 1.9.0
        - contourpy:         1.0.7
        - cryptography:      38.0.1
        - cycler:            0.11.0
        - debugpy:           1.6.6
        - decorator:         5.1.1
        - defusedxml:        0.7.1
        - dnspython:         2.2.1
        - docker-pycreds:    0.4.0
        - docopt:            0.6.2
        - exceptiongroup:    1.0.4
        - executing:         0.8.3
        - expecttest:        0.1.4
        - fastjsonschema:    2.16.3
        - filelock:          3.6.0
        - flit-core:         3.6.0
        - fonttools:         4.39.2
        - fqdn:              1.5.1
        - frozenlist:        1.3.3
        - fsspec:            2023.3.0
        - future:            0.18.2
        - gitdb:             4.0.10
        - gitpython:         3.1.31
        - glob2:             0.7
        - hypothesis:        6.61.0
        - idna:              3.4
        - ipykernel:         6.22.0
        - ipython:           8.11.0
        - ipython-genutils:  0.2.0
        - ipywidgets:        8.0.5
        - isoduration:       20.11.0
        - jedi:              0.18.1
        - jinja2:            3.1.2
        - joblib:            1.2.0
        - jsonpointer:       2.3
        - jsonschema:        4.17.3
        - jupyter:           1.0.0
        - jupyter-client:    8.1.0
        - jupyter-console:   6.6.3
        - jupyter-core:      5.3.0
        - jupyter-events:    0.6.3
        - jupyter-server:    2.5.0
        - jupyter-server-terminals: 0.4.4
        - jupyterlab-pygments: 0.2.2
        - jupyterlab-widgets: 3.0.6
        - kiwisolver:        1.4.4
        - libarchive-c:      2.9
        - lightning-utilities: 0.8.0
        - lit:               16.0.0
        - markupsafe:        2.0.1
        - matplotlib:        3.7.1
        - matplotlib-inline: 0.1.6
        - mistune:           2.0.5
        - mkl-fft:           1.3.1
        - mkl-random:        1.2.2
        - mkl-service:       2.4.0
        - mpmath:            1.2.1
        - multidict:         6.0.4
        - nbclassic:         0.5.3
        - nbclient:          0.7.2
        - nbconvert:         7.2.10
        - nbformat:          5.8.0
        - nest-asyncio:      1.5.6
        - networkx:          3.0
        - notebook:          6.5.3
        - notebook-shim:     0.2.2
        - numpy:             1.24.2
        - nvidia-cublas-cu11: 11.10.3.66
        - nvidia-cuda-cupti-cu11: 11.7.101
        - nvidia-cuda-nvrtc-cu11: 11.7.99
        - nvidia-cuda-runtime-cu11: 11.7.99
        - nvidia-cudnn-cu11: 8.5.0.96
        - nvidia-cufft-cu11: 10.9.0.58
        - nvidia-curand-cu11: 10.2.10.91
        - nvidia-cusolver-cu11: 11.4.0.1
        - nvidia-cusparse-cu11: 11.7.4.91
        - nvidia-nccl-cu11:  2.14.3
        - nvidia-nvtx-cu11:  11.7.91
        - omegaconf:         2.3.0
        - onedrivedownloader: 1.1.3
        - packaging:         23.0
        - pandocfilters:     1.5.0
        - parso:             0.8.3
        - pathtools:         0.1.2
        - pexpect:           4.8.0
        - pickleshare:       0.7.5
        - pillow:            9.4.0
        - pip:               22.3.1
        - pipreqs:           0.4.11
        - pkginfo:           1.8.3
        - platformdirs:      3.2.0
        - pluggy:            1.0.0
        - prometheus-client: 0.16.0
        - prompt-toolkit:    3.0.38
        - protobuf:          4.22.1
        - psutil:            5.9.0
        - ptyprocess:        0.7.0
        - pure-eval:         0.2.2
        - pycodestyle:       2.10.0
        - pycosat:           0.6.4
        - pycparser:         2.21
        - pygments:          2.11.2
        - pyopenssl:         22.0.0
        - pyparsing:         3.0.9
        - pyrsistent:        0.19.3
        - pysocks:           1.7.1
        - python-dateutil:   2.8.2
        - python-etcd:       0.4.5
        - python-json-logger: 2.0.7
        - pytorch-lightning: 2.0.0
        - pytz:              2022.1
        - pyyaml:            6.0
        - pyzmq:             25.0.2
        - qtconsole:         5.4.1
        - qtpy:              2.3.0
        - requests:          2.28.1
        - rfc3339-validator: 0.1.4
        - rfc3986-validator: 0.1.1
        - ruamel.yaml:       0.17.21
        - ruamel.yaml.clib:  0.2.6
        - scikit-learn:      1.2.2
        - scipy:             1.10.1
        - send2trash:        1.8.0
        - sentry-sdk:        1.17.0
        - setproctitle:      1.3.2
        - setuptools:        65.5.0
        - six:               1.16.0
        - smmap:             5.0.0
        - sniffio:           1.3.0
        - sortedcontainers:  2.4.0
        - soupsieve:         2.3.2.post1
        - stack-data:        0.2.0
        - sympy:             1.11.1
        - terminado:         0.17.1
        - thop:              0.1.1.post2209072238
        - threadpoolctl:     3.1.0
        - tinycss2:          1.2.1
        - toml:              0.10.2
        - tomli:             2.0.1
        - toolz:             0.12.0
        - torch:             2.0.0
        - torchelastic:      0.2.2
        - torchmetrics:      0.11.4
        - torchsummary:      1.5.1
        - torchtext:         0.14.1
        - torchvision:       0.15.1
        - tornado:           6.2
        - tqdm:              4.65.0
        - traitlets:         5.7.1
        - triton:            2.0.0
        - types-dataclasses: 0.6.6
        - typing-extensions: 4.4.0
        - uri-template:      1.2.0
        - urllib3:           1.26.13
        - wandb:             0.14.0
        - wcwidth:           0.2.5
        - webcolors:         1.12
        - webencodings:      0.5.1
        - websocket-client:  1.5.1
        - wheel:             0.37.1
        - widgetsnbextension: 4.0.6
        - yarg:              0.1.9
        - yarl:              1.8.2
* System:
        - OS:                Linux
        - architecture:
                - 64bit
                - 
        - processor:         x86_64
        - python:            3.10.8
        - version:           #74~20.04.1-Ubuntu SMP Wed Feb 22 14:52:34 UTC 2023

cc @justusschock @awaelchli

awaelchli commented 1 year ago

@rafathasan Sorry for the late response. Have you tried setting CUDA_LAUNCH_BLOCKING=1 as suggested in the error message? The error could be misleading and hiding the actual error message. It is also possible that this could have been OOM (out of memory).

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions - the Lightning Team!

abumafrim commented 5 months ago

@rafathasan Sorry for the late response. Have you tried setting CUDA_LAUNCH_BLOCKING=1 as suggested in the error message? The error could be misleading and hiding the actual error message. It is also possible that this could have been OOM (out of memory).

Thank you, reducing the eval batch size solved my problems.