RuntimeError: DataLoader worker (pid 1201923) is killed by signal: Floating point exception.

Liangym1225 commented 3 months ago

Thanks for your great work! @donydchen I tried to train MVSplat using processed Realestate10K dataset provided by pixelSplat's author, but following error occurred. The training loop run successfully for 10K steps. I have no idea what this is. Maybe a zero division? Have you faced this error before?

Error executing job with overrides: ['+experiment=re10k', 'data_loader.train.batch_size=8'] Traceback (most recent call last): File "/home/liang/mvsplat/src/main.py", line 141, in train trainer.fit(model_wrapper, datamodule=data_module, ckpt_path=checkpoint_path) File "/home/liang/anaconda3/envs/mvsplat/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 544, in fit call._call_and_handle_interrupt( File "/home/liang/anaconda3/envs/mvsplat/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 44, in _call_and_handle_interrupt return trainer_fn(*args, **kwargs) File "/home/liang/anaconda3/envs/mvsplat/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 580, in _fit_impl self._run(model, ckpt_path=ckpt_path) File "/home/liang/anaconda3/envs/mvsplat/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 987, in _run results = self._run_stage() File "/home/liang/anaconda3/envs/mvsplat/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1033, in _run_stage self.fit_loop.run() File "/home/liang/anaconda3/envs/mvsplat/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 205, in run self.advance() File "/home/liang/anaconda3/envs/mvsplat/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 363, in advance self.epoch_loop.run(self._data_fetcher) File "/home/liang/anaconda3/envs/mvsplat/lib/python3.10/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 140, in run self.advance(data_fetcher) File "/home/liang/anaconda3/envs/mvsplat/lib/python3.10/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 223, in advance batch = call._call_strategy_hook(trainer, "batch_to_device", batch, dataloader_idx=0) File "/home/liang/anaconda3/envs/mvsplat/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 309, in _call_strategy_hook output = fn(*args, **kwargs) File "/home/liang/anaconda3/envs/mvsplat/lib/python3.10/site-packages/pytorch_lightning/strategies/strategy.py", line 278, in batch_to_device return model._apply_batch_transfer_handler(batch, device=device, dataloader_idx=dataloader_idx) File "/home/liang/anaconda3/envs/mvsplat/lib/python3.10/site-packages/pytorch_lightning/core/module.py", line 347, in _apply_batch_transfer_handler batch = self._call_batch_hook("transfer_batch_to_device", batch, device, dataloader_idx) File "/home/liang/anaconda3/envs/mvsplat/lib/python3.10/site-packages/pytorch_lightning/core/module.py", line 336, in _call_batch_hook return trainer_method(trainer, hook_name, *args) File "/home/liang/anaconda3/envs/mvsplat/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 157, in _call_lightning_module_hook output = fn(*args, **kwargs) File "/home/liang/anaconda3/envs/mvsplat/lib/python3.10/site-packages/pytorch_lightning/core/hooks.py", line 613, in transfer_batch_to_device return move_data_to_device(batch, device) File "/home/liang/anaconda3/envs/mvsplat/lib/python3.10/site-packages/lightning_fabric/utilities/apply_func.py", line 103, in move_data_to_device return apply_to_collection(batch, dtype=_TransferableDataType, function=batch_to) File "/home/liang/anaconda3/envs/mvsplat/lib/python3.10/site-packages/lightning_utilities/core/apply_func.py", line 72, in apply_to_collection return _apply_to_collection_slow( File "/home/liang/anaconda3/envs/mvsplat/lib/python3.10/site-packages/lightning_utilities/core/apply_func.py", line 104, in _apply_to_collection_slow v = _apply_to_collection_slow( File "/home/liang/anaconda3/envs/mvsplat/lib/python3.10/site-packages/lightning_utilities/core/apply_func.py", line 104, in _apply_to_collection_slow v = _apply_to_collection_slow( File "/home/liang/anaconda3/envs/mvsplat/lib/python3.10/site-packages/lightning_utilities/core/apply_func.py", line 96, in _apply_to_collection_slow return function(data, *args, **kwargs) File "/home/liang/anaconda3/envs/mvsplat/lib/python3.10/site-packages/lightning_fabric/utilities/apply_func.py", line 97, in batch_to data_output = data.to(device, **kwargs) File "/home/liang/anaconda3/envs/mvsplat/lib/python3.10/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler _error_if_any_worker_fails() RuntimeError: DataLoader worker (pid 1201923) is killed by signal: Floating point exception. Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

donydchen commented 3 months ago

Hi @Liangym1225, thanks for your interest in our work.

I have not encountered any errors like this before. It seems that the error comes from the DataLoader, so I don't think this would be related to zero division. To me, it looks more like a device or data issue. May I know what the batch_size is and on what GPU you are training? I have trained the model on both A100 and V100 before and mainly experimented with batch_size=14x1 or batch_size=2x7, the training always seems to be very stable.

Liangym1225 commented 3 months ago

I set batch_size=8 and my GPU is RTX 6000 Ada (48GB). I am now doing an experiment with num_workers=0 to see if it is something related to multiprocessing.

Liangym1225 commented 3 months ago

With num_workers=0, the error seems to be gone. It has been running for more than 40K steps and it is still running, but much slower. This could be an issue about multiprocessing caused by pytorch-lightning. I am not sure because I am not familiar with pytorch-lightning. The version of pytorch-lightning I am using is pytorch-lightning==2.2.1. May I have yours?

Something else I noticed is that steps per epoch became smaller(purple) , and the visualization of the point cloud was strange(left), when trained with num_workers=0. Are these as expected? スクリーンショット 2024-03-31 11 27 40 スクリーンショット 2024-03-31 11 25 58

donydchen commented 3 months ago

Hi @Liangym1225, we used pytorch-lightning==2.2.0 in our experiments. Normally, when reducing the batch_size, the steps per epoch will become larger. But ideally, if you only change the num_workers, my understanding is that this would not affect the steps per epoch, you may need to double check how pytorch-lightning calculates this value. Anyway, this should not affect the performance; as long as the training step is set to 300k, it should be OK.

The point cloud projection seems normal. For some scenes, they do look weird, not all the cases are perfect.

But training with num_workers=0 will be too slow. I would recommend that you synchronise with all the recent updates in this repo and try to train with a rather smaller batch size (if batch_size=8 occupies nearly all of your GPU resources before) but keep a larger num_workers. Although a smaller batch size might lead to slightly worse performance, at least the training speed is more acceptable with a larger num_workers.

Liangym1225 commented 3 months ago

Thank you for your advice. I downgraded pytorch-lightning, but it still did't work. I ran a pytorch-lightning script of image classification with num_workers>0, and no errors occurred. So, I think the error might not be related to multiprocessing. I am trying to identify the specific line where the error occurs.

donydchen commented 3 months ago

I still feel that this might be related to out-of-memory issues, either GPU or RAM memory. Have you tried to reduce the batch size? e.g., set batch_size=4; although this might significantly impact the performance, it should be helpful for debugging.

Liangym1225 commented 3 months ago

Not yet. I will try it later.

Liangym1225 commented 2 months ago

I set batch_size = 4, but it still didn't work. I found out that floating point exception always happened at this line, so I changed the interpolation algorithm to bicubic. After that, floating point exception was gone, but I started encountering Segmentation fault or multiprocessing.context.AuthenticationError: digest sent was rejected. There is an issue reports that segmentation fault happens in dataloader in Ubuntu 22.04.4 LTS. I am also using Ubuntu22.04.4 LTS, so I wonder if it is an OS related problem. I am planning to update my OS to Ubuntu23.10.

Liangym1225 commented 2 months ago

The script ran successfully on another GPU server. This is a device issue.

donydchen / mvsplat

RuntimeError: DataLoader worker (pid 1201923) is killed by signal: Floating point exception. #12