Nixtla / neuralforecast

Scalable and user friendly neural :brain: forecasting algorithms.
https://nixtlaverse.nixtla.io/neuralforecast
Apache License 2.0
2.98k stars 342 forks source link

ProcessExitedException: process 1 terminated with signal SIGSEGV #482

Closed rumeysskara closed 1 year ago

rumeysskara commented 1 year ago

I'm working on NBEATSx I have 160 column features in my dataset. What could cause this error?

ProcessExitedException Traceback (most recent call last) Cell In [22], line 1 ----> 1 model.fit(dataset=dataset)

File /opt/envs/venv-mlops/lib/python3.8/site-packages/neuralforecast/common/_base_windows.py:569, in BaseWindows.fit(self, dataset, val_size, test_size) 566 self.trainer_kwargs["check_val_every_n_epoch"] = check_val_every_n_epoch 568 trainer = pl.Trainer(**self.trainer_kwargs) --> 569 trainer.fit(self, datamodule=datamodule)

File /opt/envs/venv-mlops/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py:608, in Trainer.fit(self, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path) 606 model = self._maybe_unwrap_optimized(model) 607 self.strategy._lightning_module = model --> 608 call._call_and_handle_interrupt( 609 self, self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path 610 )

File /opt/envs/venv-mlops/lib/python3.8/site-packages/pytorch_lightning/trainer/call.py:36, in _call_and_handle_interrupt(trainer, trainer_fn, *args, kwargs) 34 try: 35 if trainer.strategy.launcher is not None: ---> 36 return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, *kwargs) 37 else: 38 return trainer_fn(args, kwargs)

File /opt/envs/venv-mlops/lib/python3.8/site-packages/pytorch_lightning/strategies/launchers/multiprocessing.py:113, in _MultiProcessingLauncher.launch(self, function, trainer, *args, **kwargs) 110 else: 111 process_args = [trainer, function, args, kwargs, return_queue] --> 113 mp.start_processes( 114 self._wrapping_function, 115 args=process_args, 116 nprocs=self._strategy.num_processes, 117 start_method=self._start_method, 118 ) 119 worker_output = return_queue.get() 120 if trainer is None:

File /opt/envs/venv-mlops/lib/python3.8/site-packages/torch/multiprocessing/spawn.py:197, in start_processes(fn, args, nprocs, join, daemon, start_method) 194 return context 196 # Loop on join until it returns True or raises an exception. --> 197 while not context.join(): 198 pass

File /opt/envs/venv-mlops/lib/python3.8/site-packages/torch/multiprocessing/spawn.py:140, in ProcessContext.join(self, timeout) 138 if exitcode < 0: 139 name = signal.Signals(-exitcode).name --> 140 raise ProcessExitedException( 141 "process %d terminated with signal %s" % 142 (error_index, name), 143 error_index=error_index, 144 error_pid=failed_process.pid, 145 exit_code=exitcode, 146 signal_name=name 147 ) 148 else: 149 raise ProcessExitedException( 150 "process %d terminated with exit code %d" % 151 (error_index, exitcode), (...) 154 exit_code=exitcode 155 )

ProcessExitedException: process 1 terminated with signal SIGSEGV

kdgutier commented 1 year ago

Hi @rumeysskara,

It seems to me that your problem is related to distributed training memory issues. Here is a pytorch post with a similar issue: https://discuss.pytorch.org/t/how-to-fix-a-sigsegv-in-pytorch-when-using-distributed-training-e-g-ddp/113518/2

Would you be able to confirm if a single GPU works fine?

The post offers a solution of pytorch/CUDA version management.