Lightning-AI / pytorch-lightning

Pretrain, finetune and deploy AI models on multiple GPUs, TPUs with zero code changes.
https://lightning.ai
Apache License 2.0
27.54k stars 3.3k forks source link

Dataloader with >0 workers when using DDP causes a crash #20054

Open alexanderswerdlow opened 1 week ago

alexanderswerdlow commented 1 week ago

Bug description

Having a dataloader with >0 workers causes a crash. This behavior occurs both with custom datasets, and even standard huggingface datasets, and torchvision datasets.

The dataloaders work fine standalone with many workers, and also work with accelerate just fine.

The run general works until the first validation step at which point it crashes. Interestingly, num_sanity_val_steps works fine [e.g., num_sanity_val_steps=10].

Working version:

def main(config):
    """Main entry point for training."""
    _print_config(config, resolve=True, save_cfg=True)
    tokenizer = get_tokenizer(config)
    train_dataloader, val_dataloader = get_dataloaders(config, tokenizer=tokenizer, valid_seed=config.seed)

    for i in range(10):
        for batch in tqdm(train_dataloader):
            pass

        for batch in tqdm(val_dataloader):
            pass

if __name__ == "__main__":
    main()

Not working:

trainer.fit(model, train_ds, valid_ds)

What version are you seeing the problem on?

v2.2, master

How to reproduce the bug

No response

Error messages and logs

Traceback:

terminate called after throwing an instance of 'c10::Error'                                                                                                                                                                                                      | 0/? [00:00<?, ?it/s]
  what():  CUDA error: initialization error
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at /opt/conda/conda-bld/pytorch_1716905979055/work/c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f7f99b0d897 in /homedir/envs/envname/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f7f99abdb25 in /homedir/envs/envname/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f7f99be7718 in /homedir/envs/envname/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x1d6f6 (0x7f7f99bb26f6 in /homedir/envs/envname/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #4: <unknown function> + 0x1f5e3 (0x7f7f99bb45e3 in /homedir/envs/envname/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #5: <unknown function> + 0x1f922 (0x7f7f99bb4922 in /homedir/envs/envname/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #6: <unknown function> + 0x5a5950 (0x7f7fe82d8950 in /homedir/envs/envname/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #7: <unknown function> + 0x6a36f (0x7f7f99af236f in /homedir/envs/envname/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #8: c10::TensorImpl::~TensorImpl() + 0x21b (0x7f7f99aeb1cb in /homedir/envs/envname/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #9: c10::TensorImpl::~TensorImpl() + 0x9 (0x7f7f99aeb379 in /homedir/envs/envname/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #10: <unknown function> + 0x851088 (0x7f7fe8584088 in /homedir/envs/envname/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #11: THPVariable_subclass_dealloc(_object*) + 0x2f6 (0x7f7fe8584406 in /homedir/envs/envname/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>

terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: initialization error
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at /opt/conda/conda-bld/pytorch_1716905979055/work/c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f7c3193d897 in /homedir/envs/envname/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f7c318edb25 in /homedir/envs/envname/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f7c31a17718 in /homedir/envs/envname/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x1d6f6 (0x7f7c319e26f6 in /homedir/envs/envname/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #4: <unknown function> + 0x1f5e3 (0x7f7c319e45e3 in /homedir/envs/envname/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #5: <unknown function> + 0x1f922 (0x7f7c319e4922 in /homedir/envs/envname/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #6: <unknown function> + 0x5a5950 (0x7f7c80108950 in /homedir/envs/envname/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #7: <unknown function> + 0x6a36f (0x7f7c3192236f in /homedir/envs/envname/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #8: c10::TensorImpl::~TensorImpl() + 0x21b (0x7f7c3191b1cb in /homedir/envs/envname/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #9: c10::TensorImpl::~TensorImpl() + 0x9 (0x7f7c3191b379 in /homedir/envs/envname/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #10: <unknown function> + 0x851088 (0x7f7c803b4088 in /homedir/envs/envname/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #11: THPVariable_subclass_dealloc(_object*) + 0x2f6 (0x7f7c803b4406 in /homedir/envs/envname/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #12: <unknown function> + 0x124633 (0x55d22c939633 in /homedir/envs/envname/bin/python)
frame #13: <unknown function> + 0x13d697 (0x55d22c952697 in /homedir/envs/envname/bin/python)
frame #14: <unknown function> + 0x13d77b (0x55d22c95277b in /homedir/envs/envname/bin/python)
frame #15: <unknown function> + 0x13d77b (0x55d22c95277b in /homedir/envs/envname/bin/python)
frame #16: <unknown function> + 0x13d77b (0x55d22c95277b in /homedir/envs/envname/bin/python)
frame #17: <unknown function> + 0x13d77b (0x55d22c95277b in /homedir/envs/envname/bin/python)
frame #18: <unknown function> + 0x13d77b (0x55d22c95277b in /homedir/envs/envname/bin/python)
frame #19: <unknown function> + 0x13d77b (0x55d22c95277b in /homedir/envs/envname/bin/python)
frame #20: <unknown function> + 0x13d77b (0x55d22c95277b in /homedir/envs/envname/bin/python)
frame #21: <unknown function> + 0x13d77b (0x55d22c95277b in /homedir/envs/envname/bin/python)
frame #22: <unknown function> + 0x13d77b (0x55d22c95277b in /homedir/envs/envname/bin/python)
frame #23: <unknown function> + 0x13d77b (0x55d22c95277b in /homedir/envs/envname/bin/python)
frame #24: <unknown function> + 0x13d77b (0x55d22c95277b in /homedir/envs/envname/bin/python)
frame #25: <unknown function> + 0x13d77b (0x55d22c95277b in /homedir/envs/envname/bin/python)
frame #26: <unknown function> + 0x13d77b (0x55d22c95277b in /homedir/envs/envname/bin/python)
frame #27: <unknown function> + 0x13d77b (0x55d22c95277b in /homedir/envs/envname/bin/python)
frame #28: <unknown function> + 0x13d77b (0x55d22c95277b in /homedir/envs/envname/bin/python)
frame #29: <unknown function> + 0x13d77b (0x55d22c95277b in /homedir/envs/envname/bin/python)
frame #30: <unknown function> + 0x13d77b (0x55d22c95277b in /homedir/envs/envname/bin/python)
frame #31: <unknown function> + 0x14dcf6 (0x55d22c962cf6 in /homedir/envs/envname/bin/python)
frame #32: <unknown function> + 0x129739 (0x55d22c93e739 in /homedir/envs/envname/bin/python)
frame #33: <unknown function> + 0x12763d (0x55d22c93c63d in /homedir/envs/envname/bin/python)
frame #34: <unknown function> + 0x1d418b (0x55d22c9e918b in /homedir/envs/envname/bin/python)
frame #35: _PyObject_GC_NewVar + 0x23f (0x55d22c93147f in /homedir/envs/envname/bin/python)
frame #36: PyTuple_New + 0x117 (0x55d22c938aa7 in /homedir/envs/envname/bin/python)
frame #37: <unknown function> + 0x1320b5 (0x55d22c9470b5 in /homedir/envs/envname/bin/python)
frame #38: <unknown function> + 0x1321d1 (0x55d22c9471d1 in /homedir/envs/envname/bin/python)
frame #39: <unknown function> + 0x131e4e (0x55d22c946e4e in /homedir/envs/envname/bin/python)
frame #40: <unknown function> + 0x1d7844 (0x55d22c9ec844 in /homedir/envs/envname/bin/python)
frame #41: <unknown function> + 0x1ea6eb (0x55d22c9ff6eb in /homedir/envs/envname/bin/python)
frame #42: <unknown function> + 0x143e8a (0x55d22c958e8a in /homedir/envs/envname/bin/python)
frame #43: _PyEval_EvalFrameDefault + 0x4c12 (0x55d22c94e142 in /homedir/envs/envname/bin/python)
frame #44: _PyFunction_Vectorcall + 0x6c (0x55d22c959a2c in /homedir/envs/envname/bin/python)
frame #45: _PyEval_EvalFrameDefault + 0x13ca (0x55d22c94a8fa in /homedir/envs/envname/bin/python)
frame #46: _PyFunction_Vectorcall + 0x6c (0x55d22c959a2c in /homedir/envs/envname/bin/python)
frame #47: _PyEval_EvalFrameDefault + 0x72c (0x55d22c949c5c in /homedir/envs/envname/bin/python)
frame #48: _PyFunction_Vectorcall + 0x6c (0x55d22c959a2c in /homedir/envs/envname/bin/python)
frame #49: _PyEval_EvalFrameDefault + 0x72c (0x55d22c949c5c in /homedir/envs/envname/bin/python)
frame #50: _PyFunction_Vectorcall + 0x6c (0x55d22c959a2c in /homedir/envs/envname/bin/python)
frame #51: _PyEval_EvalFrameDefault + 0x320 (0x55d22c949850 in /homedir/envs/envname/bin/python)
frame #52: _PyFunction_Vectorcall + 0x6c (0x55d22c959a2c in /homedir/envs/envname/bin/python)
frame #53: _PyEval_EvalFrameDefault + 0x320 (0x55d22c949850 in /homedir/envs/envname/bin/python)
frame #54: _PyFunction_Vectorcall + 0x6c (0x55d22c959a2c in /homedir/envs/envname/bin/python)
frame #55: <unknown function> + 0x144208 (0x55d22c959208 in /homedir/envs/envname/bin/python)
frame #56: _PyObject_CallMethodIdObjArgs + 0x169 (0x55d22c967419 in /homedir/envs/envname/bin/python)
frame #57: <unknown function> + 0x75187 (0x55d22c88a187 in /homedir/envs/envname/bin/python)
frame #58: _PyEval_EvalFrameDefault + 0x3e3b (0x55d22c94d36b in /homedir/envs/envname/bin/python)
frame #59: <unknown function> + 0x1d7c60 (0x55d22c9ecc60 in /homedir/envs/envname/bin/python)
frame #60: PyEval_EvalCode + 0x87 (0x55d22c9ecba7 in /homedir/envs/envname/bin/python)
frame #61: <unknown function> + 0x1dedaa (0x55d22c9f3daa in /homedir/envs/envname/bin/python)
frame #62: <unknown function> + 0x144bf3 (0x55d22c959bf3 in /homedir/envs/envname/bin/python)
frame #63: _PyEval_EvalFrameDefault + 0x5cd5 (0x55d22c94f205 in /homedir/envs/envname/bin/python)

terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: initialization error
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at /opt/conda/conda-bld/pytorch_1716905979055/work/c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fe326374897 in /homedir/envs/envname/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7fe326324b25 in /homedir/envs/envname/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7fe32644e718 in /homedir/envs/envname/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x1d6f6 (0x7fe3264196f6 in /homedir/envs/envname/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #4: <unknown function> + 0x1f5e3 (0x7fe32641b5e3 in /homedir/envs/envname/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #5: <unknown function> + 0x1f922 (0x7fe32641b922 in /homedir/envs/envname/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #6: <unknown function> + 0x5a5950 (0x7fe374b3f950 in /homedir/envs/envname/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #7: <unknown function> + 0x6a36f (0x7fe32635936f in /homedir/envs/envname/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #8: c10::TensorImpl::~TensorImpl() + 0x21b (0x7fe3263521cb in /homedir/envs/envname/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #9: c10::TensorImpl::~TensorImpl() + 0x9 (0x7fe326352379 in /homedir/envs/envname/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #10: <unknown function> + 0x851088 (0x7fe374deb088 in /homedir/envs/envname/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #11: THPVariable_subclass_dealloc(_object*) + 0x2f6 (0x7fe374deb406 in /homedir/envs/envname/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #12: <unknown function> + 0x124633 (0x563677bbe633 in /homedir/envs/envname/bin/python)
frame #13: <unknown function> + 0x13d697 (0x563677bd7697 in /homedir/envs/envname/bin/python)
frame #14: <unknown function> + 0x13d77b (0x563677bd777b in /homedir/envs/envname/bin/python)
frame #15: <unknown function> + 0x13d77b (0x563677bd777b in /homedir/envs/envname/bin/python)
frame #16: <unknown function> + 0x13d77b (0x563677bd777b in /homedir/envs/envname/bin/python)
frame #17: <unknown function> + 0x13d77b (0x563677bd777b in /homedir/envs/envname/bin/python)
frame #18: <unknown function> + 0x13d77b (0x563677bd777b in /homedir/envs/envname/bin/python)
frame #19: <unknown function> + 0x13d77b (0x563677bd777b in /homedir/envs/envname/bin/python)
frame #20: <unknown function> + 0x13d77b (0x563677bd777b in /homedir/envs/envname/bin/python)
frame #21: <unknown function> + 0x13d77b (0x563677bd777b in /homedir/envs/envname/bin/python)
frame #22: <unknown function> + 0x13d77b (0x563677bd777b in /homedir/envs/envname/bin/python)
frame #23: <unknown function> + 0x13d77b (0x563677bd777b in /homedir/envs/envname/bin/python)
frame #24: <unknown function> + 0x13d77b (0x563677bd777b in /homedir/envs/envname/bin/python)
frame #25: <unknown function> + 0x13d77b (0x563677bd777b in /homedir/envs/envname/bin/python)
frame #26: <unknown function> + 0x13d77b (0x563677bd777b in /homedir/envs/envname/bin/python)
frame #27: <unknown function> + 0x13d77b (0x563677bd777b in /homedir/envs/envname/bin/python)
frame #28: <unknown function> + 0x13d77b (0x563677bd777b in /homedir/envs/envname/bin/python)
frame #29: <unknown function> + 0x13d77b (0x563677bd777b in /homedir/envs/envname/bin/python)
frame #30: <unknown function> + 0x13d77b (0x563677bd777b in /homedir/envs/envname/bin/python)
frame #31: <unknown function> + 0x14dcf6 (0x563677be7cf6 in /homedir/envs/envname/bin/python)
frame #32: <unknown function> + 0x129739 (0x563677bc3739 in /homedir/envs/envname/bin/python)
frame #33: <unknown function> + 0x12763d (0x563677bc163d in /homedir/envs/envname/bin/python)
frame #34: <unknown function> + 0x1d418b (0x563677c6e18b in /homedir/envs/envname/bin/python)
frame #35: _PyObject_GC_NewVar + 0x23f (0x563677bb647f in /homedir/envs/envname/bin/python)
frame #36: PyTuple_New + 0x117 (0x563677bbdaa7 in /homedir/envs/envname/bin/python)
frame #37: <unknown function> + 0x1320b5 (0x563677bcc0b5 in /homedir/envs/envname/bin/python)
frame #38: <unknown function> + 0x1321d1 (0x563677bcc1d1 in /homedir/envs/envname/bin/python)
frame #39: <unknown function> + 0x131e4e (0x563677bcbe4e in /homedir/envs/envname/bin/python)
frame #40: <unknown function> + 0x1d7844 (0x563677c71844 in /homedir/envs/envname/bin/python)
frame #41: <unknown function> + 0x1ea6eb (0x563677c846eb in /homedir/envs/envname/bin/python)
frame #42: <unknown function> + 0x143e8a (0x563677bdde8a in /homedir/envs/envname/bin/python)
frame #43: _PyEval_EvalFrameDefault + 0x4c12 (0x563677bd3142 in /homedir/envs/envname/bin/python)
frame #44: _PyFunction_Vectorcall + 0x6c (0x563677bdea2c in /homedir/envs/envname/bin/python)
frame #45: _PyEval_EvalFrameDefault + 0x13ca (0x563677bcf8fa in /homedir/envs/envname/bin/python)
frame #46: _PyFunction_Vectorcall + 0x6c (0x563677bdea2c in /homedir/envs/envname/bin/python)
frame #47: _PyEval_EvalFrameDefault + 0x72c (0x563677bcec5c in /homedir/envs/envname/bin/python)
frame #48: _PyFunction_Vectorcall + 0x6c (0x563677bdea2c in /homedir/envs/envname/bin/python)
frame #49: _PyEval_EvalFrameDefault + 0x72c (0x563677bcec5c in /homedir/envs/envname/bin/python)
frame #50: _PyFunction_Vectorcall + 0x6c (0x563677bdea2c in /homedir/envs/envname/bin/python)
frame #51: _PyEval_EvalFrameDefault + 0x320 (0x563677bce850 in /homedir/envs/envname/bin/python)
frame #52: _PyFunction_Vectorcall + 0x6c (0x563677bdea2c in /homedir/envs/envname/bin/python)
frame #53: _PyEval_EvalFrameDefault + 0x320 (0x563677bce850 in /homedir/envs/envname/bin/python)
frame #54: _PyFunction_Vectorcall + 0x6c (0x563677bdea2c in /homedir/envs/envname/bin/python)
frame #55: <unknown function> + 0x144208 (0x563677bde208 in /homedir/envs/envname/bin/python)
frame #56: _PyObject_CallMethodIdObjArgs + 0x169 (0x563677bec419 in /homedir/envs/envname/bin/python)
frame #57: <unknown function> + 0x75187 (0x563677b0f187 in /homedir/envs/envname/bin/python)
frame #58: _PyEval_EvalFrameDefault + 0x3e3b (0x563677bd236b in /homedir/envs/envname/bin/python)
frame #59: <unknown function> + 0x1d7c60 (0x563677c71c60 in /homedir/envs/envname/bin/python)
frame #60: PyEval_EvalCode + 0x87 (0x563677c71ba7 in /homedir/envs/envname/bin/python)
frame #61: <unknown function> + 0x1dedaa (0x563677c78daa in /homedir/envs/envname/bin/python)
frame #62: <unknown function> + 0x144bf3 (0x563677bdebf3 in /homedir/envs/envname/bin/python)
frame #63: _PyEval_EvalFrameDefault + 0x5cd5 (0x563677bd4205 in /homedir/envs/envname/bin/python)

terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: initialization error
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at /opt/conda/conda-bld/pytorch_1716905979055/work/c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f549635d897 in /homedir/envs/envname/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f549630db25 in /homedir/envs/envname/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f5496437718 in /homedir/envs/envname/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x1d6f6 (0x7f54964026f6 in /homedir/envs/envname/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #4: <unknown function> + 0x1f5e3 (0x7f54964045e3 in /homedir/envs/envname/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #5: <unknown function> + 0x1f922 (0x7f5496404922 in /homedir/envs/envname/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #6: <unknown function> + 0x5a5950 (0x7f54e4b28950 in /homedir/envs/envname/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #7: <unknown function> + 0x6a36f (0x7f549634236f in /homedir/envs/envname/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #8: c10::TensorImpl::~TensorImpl() + 0x21b (0x7f549633b1cb in /homedir/envs/envname/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #9: c10::TensorImpl::~TensorImpl() + 0x9 (0x7f549633b379 in /homedir/envs/envname/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #10: <unknown function> + 0x851088 (0x7f54e4dd4088 in /homedir/envs/envname/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #11: THPVariable_subclass_dealloc(_object*) + 0x2f6 (0x7f54e4dd4406 in /homedir/envs/envname/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #12: <unknown function> + 0x124633 (0x5557111ab633 in /homedir/envs/envname/bin/python)
frame #13: <unknown function> + 0x13d697 (0x5557111c4697 in /homedir/envs/envname/bin/python)
frame #14: <unknown function> + 0x13d77b (0x5557111c477b in /homedir/envs/envname/bin/python)
frame #15: <unknown function> + 0x13d77b (0x5557111c477b in /homedir/envs/envname/bin/python)
frame #16: <unknown function> + 0x13d77b (0x5557111c477b in /homedir/envs/envname/bin/python)
frame #17: <unknown function> + 0x13d77b (0x5557111c477b in /homedir/envs/envname/bin/python)
frame #18: <unknown function> + 0x13d77b (0x5557111c477b in /homedir/envs/envname/bin/python)
frame #19: <unknown function> + 0x13d77b (0x5557111c477b in /homedir/envs/envname/bin/python)
frame #20: <unknown function> + 0x13d77b (0x5557111c477b in /homedir/envs/envname/bin/python)
frame #21: <unknown function> + 0x13d77b (0x5557111c477b in /homedir/envs/envname/bin/python)
frame #22: <unknown function> + 0x13d77b (0x5557111c477b in /homedir/envs/envname/bin/python)
frame #23: <unknown function> + 0x13d77b (0x5557111c477b in /homedir/envs/envname/bin/python)
frame #24: <unknown function> + 0x13d77b (0x5557111c477b in /homedir/envs/envname/bin/python)
frame #25: <unknown function> + 0x13d77b (0x5557111c477b in /homedir/envs/envname/bin/python)
frame #26: <unknown function> + 0x13d77b (0x5557111c477b in /homedir/envs/envname/bin/python)
frame #27: <unknown function> + 0x13d77b (0x5557111c477b in /homedir/envs/envname/bin/python)
frame #28: <unknown function> + 0x13d77b (0x5557111c477b in /homedir/envs/envname/bin/python)
frame #29: <unknown function> + 0x13d77b (0x5557111c477b in /homedir/envs/envname/bin/python)
frame #30: <unknown function> + 0x13d77b (0x5557111c477b in /homedir/envs/envname/bin/python)
frame #31: <unknown function> + 0x14dcf6 (0x5557111d4cf6 in /homedir/envs/envname/bin/python)
frame #32: <unknown function> + 0x129739 (0x5557111b0739 in /homedir/envs/envname/bin/python)
frame #33: <unknown function> + 0x12763d (0x5557111ae63d in /homedir/envs/envname/bin/python)
frame #34: <unknown function> + 0x1d418b (0x55571125b18b in /homedir/envs/envname/bin/python)
frame #35: _PyObject_GC_NewVar + 0x23f (0x5557111a347f in /homedir/envs/envname/bin/python)
frame #36: PyTuple_New + 0x117 (0x5557111aaaa7 in /homedir/envs/envname/bin/python)
frame #37: <unknown function> + 0x1320b5 (0x5557111b90b5 in /homedir/envs/envname/bin/python)
frame #38: <unknown function> + 0x1321d1 (0x5557111b91d1 in /homedir/envs/envname/bin/python)
frame #39: <unknown function> + 0x131e4e (0x5557111b8e4e in /homedir/envs/envname/bin/python)
frame #40: <unknown function> + 0x1d7844 (0x55571125e844 in /homedir/envs/envname/bin/python)
frame #41: <unknown function> + 0x1ea6eb (0x5557112716eb in /homedir/envs/envname/bin/python)
frame #42: <unknown function> + 0x143e8a (0x5557111cae8a in /homedir/envs/envname/bin/python)
frame #43: _PyEval_EvalFrameDefault + 0x4c12 (0x5557111c0142 in /homedir/envs/envname/bin/python)
frame #44: _PyFunction_Vectorcall + 0x6c (0x5557111cba2c in /homedir/envs/envname/bin/python)
frame #45: _PyEval_EvalFrameDefault + 0x13ca (0x5557111bc8fa in /homedir/envs/envname/bin/python)
frame #46: _PyFunction_Vectorcall + 0x6c (0x5557111cba2c in /homedir/envs/envname/bin/python)
frame #47: _PyEval_EvalFrameDefault + 0x72c (0x5557111bbc5c in /homedir/envs/envname/bin/python)
frame #48: _PyFunction_Vectorcall + 0x6c (0x5557111cba2c in /homedir/envs/envname/bin/python)
frame #49: _PyEval_EvalFrameDefault + 0x72c (0x5557111bbc5c in /homedir/envs/envname/bin/python)
frame #50: _PyFunction_Vectorcall + 0x6c (0x5557111cba2c in /homedir/envs/envname/bin/python)
frame #51: _PyEval_EvalFrameDefault + 0x320 (0x5557111bb850 in /homedir/envs/envname/bin/python)
frame #52: _PyFunction_Vectorcall + 0x6c (0x5557111cba2c in /homedir/envs/envname/bin/python)
frame #53: _PyEval_EvalFrameDefault + 0x320 (0x5557111bb850 in /homedir/envs/envname/bin/python)
frame #54: _PyFunction_Vectorcall + 0x6c (0x5557111cba2c in /homedir/envs/envname/bin/python)
frame #55: <unknown function> + 0x144208 (0x5557111cb208 in /homedir/envs/envname/bin/python)
frame #56: _PyObject_CallMethodIdObjArgs + 0x169 (0x5557111d9419 in /homedir/envs/envname/bin/python)
frame #57: <unknown function> + 0x75187 (0x5557110fc187 in /homedir/envs/envname/bin/python)
frame #58: _PyEval_EvalFrameDefault + 0x3e3b (0x5557111bf36b in /homedir/envs/envname/bin/python)
frame #59: <unknown function> + 0x1d7c60 (0x55571125ec60 in /homedir/envs/envname/bin/python)
frame #60: PyEval_EvalCode + 0x87 (0x55571125eba7 in /homedir/envs/envname/bin/python)
frame #61: <unknown function> + 0x1dedaa (0x555711265daa in /homedir/envs/envname/bin/python)
frame #63: _PyEval_EvalFrameDefault + 0x5cd5 (0x5557111c1205 in /homedir/envs/envname/bin/python)
frame #62: <unknown function> + 0x144bf3 (0x5557111cbbf3 in /homedir/envs/envname/bin/python)

Error executing job with overrides: [''loader.num_workers=4', 'trainer.val_check_interval=2']
Traceback (most recent call last):
  File "/homedir/envs/envname/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1133, in _try_get_data
    data = self._data_queue.get(timeout=timeout)
  File "/homedir/envs/envname/lib/python3.10/queue.py", line 180, in get
    self.not_empty.wait(remaining)
  File "/homedir/envs/envname/lib/python3.10/threading.py", line 324, in wait
    gotit = waiter.acquire(True, timeout)
  File "/homedir/envs/envname/lib/python3.10/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 1076120) is killed by signal: Aborted. 

Environment

Current environment * CUDA: - GPU: - NVIDIA RTX A6000 - NVIDIA RTX A6000 - available: True - version: 12.1 * Lightning: - lightning: 2.3.2 - lightning-utilities: 0.11.2 - pytorch-lightning: 2.3.1 - torch: 2.3.1 - torch-fidelity: 0.3.0 - torch-tb-profiler: 0.4.3 - torchaudio: 2.3.1 - torchmetrics: 1.4.0.post0 - torchvision: 0.18.1 - torchx: 0.6.0 * System: - OS: Linux - architecture: - 64bit - ELF - processor: x86_64 - python: 3.10.14 - release: 4.18.0-372.32.1.el8_6.x86_64

More info

No response

cc @justusschock @awaelchli

awaelchli commented 1 week ago

Hi @alexanderswerdlow

Can you show an example that produces this error? I am not aware of any serious issue regarding dataloading in Lightning. From the error message, we can see that the dataloader worker failed, and that it had a cuda initialization error. Using any CUDA operations in your dataloading workers is not supported/recommended by PyTorch, so naturally we would expect to see issues with or without Lightning involved.

To be able to help you, I would need to see some evidence that the issue is caused by Lightning, and some code to work with to isolate the cause of it.

alexanderswerdlow commented 1 week ago

Thanks for responding! I don't have the time to continue to debug it at the moment and provide a full repro, but switching to the following works [only needed for the val dataloader]. There are no specific cuda operations in my dataloaders [and this bug happens with and without pin_memory=True]. I can confidently say it has happened with a simple torchvision imagenet dataset.

I should note it also happens on two different machines and on a newly installed conda env.

This strongly suggests it's a lightning issue. I spent a while digging into how lightning wraps dataloaders and it errors out around here, but again, not during the sanity check.

These issues [#19763, #17378, #19598] also seem to be discussing the same issue, specifically the last one [#19598] that mentions this only occurring when passing a val dataloader. I should not I am not using torch.compile and this behavior occurs even when I removed all usages of torchmetrics.

Working [no worker] dataloader:

from torch.utils.data import default_collate
class SimpleDataLoader:
    def __init__(self, dataset, batch_size=1, collate_fn=default_collate, **kwargs):
        self.dataset = dataset
        self.batch_size = batch_size
        self.collate_fn = collate_fn
        self.idx = 0

    def __iter__(self):
        return self

    def __next__(self):
        if self.idx < len(self.dataset):
            batch = []
            for _ in range(self.batch_size):
                if self.idx >= len(self.dataset):
                    break
                batch.append(self.dataset[self.idx])
                self.idx += 1
            return self.collate_fn(batch)
        else:
            raise StopIteration

    def __len__(self):
        return (len(self.dataset) + self.batch_size - 1) // self.batch_size
awaelchli commented 1 week ago

Lightning does not wrap the dataloaders. It only injects a distributed sampler when you are using a torch DataLoader, because that sampler is needed for distributed training. For dataloaders with iterable datasets, also Lightning doesn't do anything, because the user has to take care of the implementation.

The issues you linked are open for the same reason, users are not able to provide the code that reproduces the problems, which means it's not possible to investigate only based on the error message. If we have that, it will be possible for me or someone from the community to determine the root cause, which is the first step.