Closed koushik-rout-samsung closed 4 months ago
Hey. It seems like pytorch lightning is trying to perform distributed training on two GPUs and failing. Can you try any of these?
CUDA_VISIBLE_DEVICES=0 python3 run_nhits.py ...
devices=1
in your configstrategy='single_device'
in your configHi @jmoralez , thanks for the reply. First workaround is working. I have noted few points while running the auto-model.
What happened + What you expected to happen
We tried to experiments with auto_nhits using our custom dataset, we ran into the following error while training the model.
Stacktrace
```python `(_train_tune pid=386862) Seed set to 3 (_train_tune pid=386862) [rank: 0] Seed set to 3 (_train_tune pid=386862) Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2 2024-03-19 13:49:09,602 ERROR tune_controller.py:1374 -- Trial task failed for trial _train_tune_f0f98_00001 Traceback (most recent call last): File "/neural_forecast/nf_env/lib/python3.10/site-packages/ray/air/execution/_internal/event_manager.py", line 110, in resolve_future result = ray.get(future) File "/neural_forecast/nf_env/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 22, in auto_init_wrapper return fn(*args, **kwargs) File "/neural_forecast/nf_env/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper return func(*args, **kwargs) File "/neural_forecast/nf_env/lib/python3.10/site-packages/ray/_private/worker.py", line 2624, in get raise value.as_instanceof_cause() ray.exceptions.RayTaskError(DistStoreError): ray::ImplicitFunc.train() (pid=386862, ip=107.99.237.41, actor_id=d3ee59cd38f38e7df8fcc1aa01000000, repr=_train_tune) File "/neural_forecast/nf_env/lib/python3.10/site-packages/ray/tune/trainable/trainable.py", line 342, in train raise skipped from exception_cause(skipped) File "/neural_forecast/nf_env/lib/python3.10/site-packages/ray/air/_internal/util.py", line 88, in run self._ret = self._target(*self._args, **self._kwargs) File "/neural_forecast/nf_env/lib/python3.10/site-packages/ray/tune/trainable/function_trainable.py", line 115, inVersions / Dependencies
ubuntu: Ubuntu 22.04.4 LTS python: 3.10.12
Reproduction script
we have experimented with modifications to long_horizon.py to include our custom dataset. python3 run_nhits.py --horizon=96 --dataset='Custom' --num_samples=144
Issue Severity
High: It blocks me from completing my task.