Lightning-Universe / lightning-flash

Your PyTorch AI Factory - Flash enables you to easily configure and run complex AI recipes for over 15 tasks across 7 data domains
https://lightning-flash.readthedocs.io
Apache License 2.0
1.74k stars 213 forks source link

RuntimeError: DataLoader worker (pid(s) 18359) exited unexpectedly #896

Closed imneonizer closed 2 years ago

imneonizer commented 2 years ago

πŸ› Bug

I am trying to run minimal training code

import os
from torch.utils.data.sampler import RandomSampler
import flash
from flash.core.data.utils import download_data
from flash.video import VideoClassificationData, VideoClassifier

# 1. Download a video clip dataset. Find more datasets at https://pytorchvideo.readthedocs.io/en/latest/data.html
# download_data("https://pl-flash-data.s3.amazonaws.com/kinetics.zip")

# 2. Load the Data
datamodule = VideoClassificationData.from_folders(
    train_folder="kinetics/train",
    val_folder="kinetics/val",
    predict_folder="kinetics/predict",
    batch_size=8,
    clip_sampler="uniform",
    clip_duration=1,
    video_sampler=RandomSampler,
    decode_audio=False,
    num_workers=8,
)

# 3. Build the model
model = VideoClassifier(backbone="x3d_xs", num_classes=datamodule.num_classes, pretrained=False)

# 4. Create the trainer
trainer = flash.Trainer(gpus=1, max_epochs=3)

# 5. Finetune the model
trainer.finetune(model, datamodule=datamodule)

# 6. Save it!
trainer.save_checkpoint("video_classification.pt")

But it ends up unexpectidely, also my chrome crashes after this

2021-10-28 11:16:12.913324: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
Using 'x3d_xs' provided by Facebook Research/PyTorchVideo (https://github.com/facebookresearch/pytorchvideo).
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name          | Type       | Params
---------------------------------------------
0 | train_metrics | ModuleDict | 0     
1 | val_metrics   | ModuleDict | 0     
2 | backbone      | Net        | 3.8 M 
3 | head          | Sequential | 2.0 K 
---------------------------------------------
3.8 M     Trainable params
0         Non-trainable params
3.8 M     Total params
15.185    Total estimated model params size (MB)
Validation sanity check:  50%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ                                                              | 1/2 [00:01<00:01,  1.58s/it]Traceback (most recent call last):
  File "train.py", line 30, in <module>
    trainer.finetune(model, datamodule=datamodule)
  File "/anaconda3/lib/python3.8/site-packages/flash/core/trainer.py", line 188, in finetune
    return super().fit(model, train_dataloader, val_dataloaders, datamodule)
  File "/anaconda3/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 552, in fit
    self._run(model)
  File "/anaconda3/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 922, in _run
    self._dispatch()
  File "/anaconda3/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 990, in _dispatch
    self.accelerator.start_training(self)
  File "/anaconda3/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 92, in start_training
    self.training_type_plugin.start_training(trainer)
  File "/anaconda3/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 161, in start_training
    self._results = trainer.run_stage()
  File "/anaconda3/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1000, in run_stage
    return self._run_train()
  File "/anaconda3/lib/python3.8/site-packages/flash/core/trainer.py", line 112, in _run_train
    self._run_sanity_check(self.lightning_module)
  File "/anaconda3/lib/python3.8/site-packages/flash/core/trainer.py", line 94, in _run_sanity_check
    super()._run_sanity_check(ref_model)
  File "/anaconda3/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1122, in _run_sanity_check
    self._evaluation_loop.run()
  File "/anaconda3/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 111, in run
    self.advance(*args, **kwargs)
  File "/anaconda3/lib/python3.8/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 110, in advance
    dl_outputs = self.epoch_loop.run(
  File "/anaconda3/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 111, in run
    self.advance(*args, **kwargs)
  File "/anaconda3/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 111, in advance
    output = self.evaluation_step(batch, batch_idx, dataloader_idx)
  File "/anaconda3/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 158, in evaluation_step
    output = self.trainer.accelerator.validation_step(step_kwargs)
  File "/anaconda3/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 211, in validation_step
    return self.training_type_plugin.validation_step(*step_kwargs.values())
  File "/anaconda3/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 178, in validation_step
    return self.model.validation_step(*args, **kwargs)
  File "/anaconda3/lib/python3.8/site-packages/flash/core/model.py", line 448, in validation_step
    output = self.step(batch, batch_idx, self.val_metrics)
  File "/anaconda3/lib/python3.8/site-packages/flash/video/classification/model.py", line 161, in step
    return super().step((batch["video"], batch["label"]), batch_idx, metrics)
  File "/anaconda3/lib/python3.8/site-packages/flash/core/model.py", line 388, in step
    y_hat = self(x)
  File "/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/anaconda3/lib/python3.8/site-packages/flash/video/classification/model.py", line 164, in forward
    x = self.backbone(x)
  File "/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/anaconda3/lib/python3.8/site-packages/pytorchvideo/models/net.py", line 43, in forward
    x = self.blocks[idx](x)
  File "/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/anaconda3/lib/python3.8/site-packages/pytorchvideo/models/resnet.py", line 1386, in forward
    x = res_block(x)
  File "/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/anaconda3/lib/python3.8/site-packages/pytorchvideo/models/resnet.py", line 1173, in forward
    x = self.branch_fusion(shortcut, self.branch2(x))
  File "/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/anaconda3/lib/python3.8/site-packages/pytorchvideo/models/resnet.py", line 1344, in forward
    x = self.norm_b(x)
  File "/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/anaconda3/lib/python3.8/site-packages/torch/nn/modules/container.py", line 139, in forward
    input = module(input)
  File "/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/anaconda3/lib/python3.8/site-packages/fvcore/nn/squeeze_excitation.py", line 80, in forward
    output_tensor = torch.mul(input_tensor, self.block(mean_tensor))
  File "/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/anaconda3/lib/python3.8/site-packages/torch/nn/modules/container.py", line 139, in forward
    input = module(input)
  File "/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/anaconda3/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 587, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/anaconda3/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 582, in _conv_forward
    return F.conv3d(
  File "/anaconda3/lib/python3.8/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 19960) is killed by signal: Killed.

Environment

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

sudoandros commented 1 year ago

Hey @imneonizer! I'm having the same bug. Were you able to understand why is this happening?