ray.exceptions.RayTaskError(DistStoreError): ray::ImplicitFunc.train()

koushik-rout-samsung commented 4 months ago

What happened + What you expected to happen

We tried to experiments with auto_nhits using our custom dataset, we ran into the following error while training the model.

Stacktrace

```python `(_train_tune pid=386862) Seed set to 3 (_train_tune pid=386862) [rank: 0] Seed set to 3 (_train_tune pid=386862) Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2 2024-03-19 13:49:09,602 ERROR tune_controller.py:1374 -- Trial task failed for trial _train_tune_f0f98_00001 Traceback (most recent call last): File "/neural_forecast/nf_env/lib/python3.10/site-packages/ray/air/execution/_internal/event_manager.py", line 110, in resolve_future result = ray.get(future) File "/neural_forecast/nf_env/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 22, in auto_init_wrapper return fn(*args, **kwargs) File "/neural_forecast/nf_env/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper return func(*args, **kwargs) File "/neural_forecast/nf_env/lib/python3.10/site-packages/ray/_private/worker.py", line 2624, in get raise value.as_instanceof_cause() ray.exceptions.RayTaskError(DistStoreError): ray::ImplicitFunc.train() (pid=386862, ip=107.99.237.41, actor_id=d3ee59cd38f38e7df8fcc1aa01000000, repr=_train_tune) File "/neural_forecast/nf_env/lib/python3.10/site-packages/ray/tune/trainable/trainable.py", line 342, in train raise skipped from exception_cause(skipped) File "/neural_forecast/nf_env/lib/python3.10/site-packages/ray/air/_internal/util.py", line 88, in run self._ret = self._target(*self._args, **self._kwargs) File "/neural_forecast/nf_env/lib/python3.10/site-packages/ray/tune/trainable/function_trainable.py", line 115, in training_func=lambda: self._trainable_func(self.config), File "/neural_forecast/nf_env/lib/python3.10/site-packages/ray/tune/trainable/function_trainable.py", line 332, in _trainable_func output = fn() File "/neural_forecast/nf_env/lib/python3.10/site-packages/ray/tune/trainable/util.py", line 138, in inner return trainable(config, **fn_kwargs) File "/neural_forecast/nf_env/lib/python3.10/site-packages/neuralforecast/common/_base_auto.py", line 207, in _train_tune _ = self._fit_model( File "/neural_forecast/nf_env/lib/python3.10/site-packages/neuralforecast/common/_base_auto.py", line 336, in _fit_model model.fit(dataset, val_size=val_size, test_size=test_size) File "/neural_forecast/nf_env/lib/python3.10/site-packages/neuralforecast/common/_base_windows.py", line 734, in fit trainer.fit(self, datamodule=datamodule) File "/neural_forecast/nf_env/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 544, in fit call._call_and_handle_interrupt( File "/neural_forecast/nf_env/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 43, in _call_and_handle_interrupt return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs) File "/neural_forecast/nf_env/lib/python3.10/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 105, in launch return function(*args, **kwargs) File "/neural_forecast/nf_env/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 580, in _fit_impl self._run(model, ckpt_path=ckpt_path) File "/neural_forecast/nf_env/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 943, in _run self.strategy.setup_environment() File "/neural_forecast/nf_env/lib/python3.10/site-packages/pytorch_lightning/strategies/ddp.py", line 154, in setup_environment self.setup_distributed() File "/neural_forecast/nf_env/lib/python3.10/site-packages/pytorch_lightning/strategies/ddp.py", line 203, in setup_distributed _init_dist_connection(self.cluster_environment, self._process_group_backend, timeout=self._timeout) File "/neural_forecast/nf_env/lib/python3.10/site-packages/lightning_fabric/utilities/distributed.py", line 291, in _init_dist_connection torch.distributed.init_process_group(torch_distributed_backend, rank=global_rank, world_size=world_size, **kwargs) File "/neural_forecast/nf_env/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 86, in wrapper func_return = func(*args, **kwargs) File "/neural_forecast/nf_env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1177, in init_process_group store, rank, world_size = next(rendezvous_iterator) File "/neural_forecast/nf_env/lib/python3.10/site-packages/torch/distributed/rendezvous.py", line 246, in _env_rendezvous_handler store = _create_c10d_store(master_addr, master_port, rank, world_size, timeout, use_libuv) File "/neural_forecast/nf_env/lib/python3.10/site-packages/torch/distributed/rendezvous.py", line 174, in _create_c10d_store return TCPStore( torch.distributed.DistStoreError: Timed out after 1801 seconds waiting for clients. 1/2 clients joined. Trial _train_tune_f0f98_00001 errored after 0 iterations at 2024-03-19 13:49:09. Total running time: 1hr 0min 12s Error file: /ray_results/_train_tune_2024-03-19_12-48-53/_train_tune_f0f98_00001_1_activation=ReLU,batch_size=7,dropout_prob_theta=0.5000,input_size=672,interpolation_mode=linear,learning_2024-03-19_12-48-56/error.txt` ```

Versions / Dependencies

ubuntu: Ubuntu 22.04.4 LTS python: 3.10.12

Reproduction script

we have experimented with modifications to long_horizon.py to include our custom dataset. python3 run_nhits.py --horizon=96 --dataset='Custom' --num_samples=144

Issue Severity

High: It blocks me from completing my task.

jmoralez commented 4 months ago

Hey. It seems like pytorch lightning is trying to perform distributed training on two GPUs and failing. Can you try any of these?

CUDA_VISIBLE_DEVICES=0 python3 run_nhits.py ...
Setting devices=1 in your config
Setting strategy='single_device' in your config

koushik-rout-samsung commented 4 months ago

Hi @jmoralez , thanks for the reply. First workaround is working. I have noted few points while running the auto-model.

The mse and mae are quite high in spite of giving sufficient data to the model, is there a way to reduce this ?
We are trying to predict the values in range of 2-25, with somewhat sinusoidal pattern and weekly trend. Suggestions for any parameters to change for better accuracy like compensating the model more for less accurate predictions.

jmoralez commented 4 months ago

Can you please open a separate issue for these or join our slack and ask these questions in the #neuralforecast channel?

Nixtla / neuralforecast