Nixtla / neuralforecast

Scalable and user friendly neural :brain: forecasting algorithms.
https://nixtlaverse.nixtla.io/neuralforecast
Apache License 2.0
3.12k stars 359 forks source link

[<Library component: Model|Core|etc...>] AutoTimeMixer is not working #1213

Open skmanzg opened 3 hours ago

skmanzg commented 3 hours ago

What happened + What you expected to happen

I have tried to use AutoTimeMixer after successfully doing ordinary 'TimeMixer.'

This one worked well when I tried to do it. (ordinary ver)

Note: Y_train_df["unique_id"].nunique() = 5 and H = 288

model = TimeMixer(
    h=H,  
    input_size=1440,   
    n_series=Y_train_df["unique_id"].nunique(),  
    scaler_type='minmax',
    max_steps=500,
    early_stop_patience_steps=10,
    val_check_steps=50,
    learning_rate=1e-3,
    loss=MSE(),
    valid_loss=MAE(),
    batch_size=32,

    d_ff=5,
    e_layers=5,

    accelerator='auto',  
    devices='auto',
    enable_model_summary=False,
    enable_progress_bar=True
)

and then, AutoTimeMixer is not working. Both Ray and Optuna are not working. I wonder why does it happen for auto. I have tried to use many different parameters to match the tensor size only to fail to solve this problem.

The code and logs are in the section below:

Versions / Dependencies

python 3.10.14 reinstalled neuralforecast today

Reproduction script

H = 288

config1 = {
    'n_series': Y_train_df["unique_id"].nunique(),
    'input_size': 1440,
    # 'down_sampling_layers': 5,
    # 'down_sampling_window': 5,
    'scaler_type': 'minmax', 
    'batch_size': 64,
    }

config2 = AutoTimeMixer.get_default_config(h=288, backend="optuna", n_series= Y_train_df["unique_id"].nunique() )

def config_o(trial):
    return config1

model = AutoTimeMixer(
    h = H,
    n_series = Y_train_df["unique_id"].nunique(),
    config = config1,
    loss = MSE(),
    valid_loss = MSE(),
    verbose = True,
    backend = "ray",   # the error is the same when it is optuna and use config 2
    num_samples = 5,
    gpus = 1,

)

nf = NeuralForecast(models=[model], freq='10min') 

nf.fit(df=Y_train_df, val_size=288)

ERROR LOG

---------------------------------------------------------------------------
ProcessRaisedException                    Traceback (most recent call last)
Cell In[3], [line 60](vscode-notebook-cell:?execution_count=3&line=60)
     [42](vscode-notebook-cell:?execution_count=3&line=42) model = AutoTimeMixer(
     [43](vscode-notebook-cell:?execution_count=3&line=43)     h = H,
     [44](vscode-notebook-cell:?execution_count=3&line=44)     n_series = Y_train_df["unique_id"].nunique(),
   (...)
     [52](vscode-notebook-cell:?execution_count=3&line=52)     
     [53](vscode-notebook-cell:?execution_count=3&line=53) )
     [58](vscode-notebook-cell:?execution_count=3&line=58) nf = NeuralForecast(models=[model], freq='10min') 
---> [60](vscode-notebook-cell:?execution_count=3&line=60) nf.fit(df=Y_train_df, val_size=288)

File ~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/neuralforecast/core.py:544, in NeuralForecast.fit(self, df, static_df, val_size, sort_df, use_init_models, verbose, id_col, time_col, target_col, distributed_config)
    [541](https://vscode-remote+ssh-002dremote-002btrain2.vscode-resource.vscode-cdn.net/sswoon/TimeSeries/NeuralForecast_test/~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/neuralforecast/core.py:541)     self._reset_models()
    [543](https://vscode-remote+ssh-002dremote-002btrain2.vscode-resource.vscode-cdn.net/sswoon/TimeSeries/NeuralForecast_test/~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/neuralforecast/core.py:543) for i, model in enumerate(self.models):
--> [544](https://vscode-remote+ssh-002dremote-002btrain2.vscode-resource.vscode-cdn.net/sswoon/TimeSeries/NeuralForecast_test/~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/neuralforecast/core.py:544)     self.models[i] = model.fit(
    [545](https://vscode-remote+ssh-002dremote-002btrain2.vscode-resource.vscode-cdn.net/sswoon/TimeSeries/NeuralForecast_test/~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/neuralforecast/core.py:545)         self.dataset, val_size=val_size, distributed_config=distributed_config
    [546](https://vscode-remote+ssh-002dremote-002btrain2.vscode-resource.vscode-cdn.net/sswoon/TimeSeries/NeuralForecast_test/~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/neuralforecast/core.py:546)     )
    [548](https://vscode-remote+ssh-002dremote-002btrain2.vscode-resource.vscode-cdn.net/sswoon/TimeSeries/NeuralForecast_test/~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/neuralforecast/core.py:548) self._fitted = True

File ~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/neuralforecast/common/_base_auto.py:429, in BaseAuto.fit(self, dataset, val_size, test_size, random_seed, distributed_config)
    [417](https://vscode-remote+ssh-002dremote-002btrain2.vscode-resource.vscode-cdn.net/sswoon/TimeSeries/NeuralForecast_test/~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/neuralforecast/common/_base_auto.py:417)     results = self._optuna_tune_model(
    [418](https://vscode-remote+ssh-002dremote-002btrain2.vscode-resource.vscode-cdn.net/sswoon/TimeSeries/NeuralForecast_test/~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/neuralforecast/common/_base_auto.py:418)         cls_model=self.cls_model,
    [419](https://vscode-remote+ssh-002dremote-002btrain2.vscode-resource.vscode-cdn.net/sswoon/TimeSeries/NeuralForecast_test/~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/neuralforecast/common/_base_auto.py:419)         dataset=dataset,
   (...)
    [426](https://vscode-remote+ssh-002dremote-002btrain2.vscode-resource.vscode-cdn.net/sswoon/TimeSeries/NeuralForecast_test/~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/neuralforecast/common/_base_auto.py:426)         distributed_config=distributed_config,
    [427](https://vscode-remote+ssh-002dremote-002btrain2.vscode-resource.vscode-cdn.net/sswoon/TimeSeries/NeuralForecast_test/~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/neuralforecast/common/_base_auto.py:427)     )
    [428](https://vscode-remote+ssh-002dremote-002btrain2.vscode-resource.vscode-cdn.net/sswoon/TimeSeries/NeuralForecast_test/~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/neuralforecast/common/_base_auto.py:428)     best_config = results.best_trial.user_attrs["ALL_PARAMS"]
--> [429](https://vscode-remote+ssh-002dremote-002btrain2.vscode-resource.vscode-cdn.net/sswoon/TimeSeries/NeuralForecast_test/~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/neuralforecast/common/_base_auto.py:429) self.model = self._fit_model(
    [430](https://vscode-remote+ssh-002dremote-002btrain2.vscode-resource.vscode-cdn.net/sswoon/TimeSeries/NeuralForecast_test/~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/neuralforecast/common/_base_auto.py:430)     cls_model=self.cls_model,
    [431](https://vscode-remote+ssh-002dremote-002btrain2.vscode-resource.vscode-cdn.net/sswoon/TimeSeries/NeuralForecast_test/~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/neuralforecast/common/_base_auto.py:431)     config=best_config,
    [432](https://vscode-remote+ssh-002dremote-002btrain2.vscode-resource.vscode-cdn.net/sswoon/TimeSeries/NeuralForecast_test/~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/neuralforecast/common/_base_auto.py:432)     dataset=dataset,
    [433](https://vscode-remote+ssh-002dremote-002btrain2.vscode-resource.vscode-cdn.net/sswoon/TimeSeries/NeuralForecast_test/~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/neuralforecast/common/_base_auto.py:433)     val_size=val_size * self.refit_with_val,
    [434](https://vscode-remote+ssh-002dremote-002btrain2.vscode-resource.vscode-cdn.net/sswoon/TimeSeries/NeuralForecast_test/~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/neuralforecast/common/_base_auto.py:434)     test_size=test_size,
    [435](https://vscode-remote+ssh-002dremote-002btrain2.vscode-resource.vscode-cdn.net/sswoon/TimeSeries/NeuralForecast_test/~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/neuralforecast/common/_base_auto.py:435)     distributed_config=distributed_config,
    [436](https://vscode-remote+ssh-002dremote-002btrain2.vscode-resource.vscode-cdn.net/sswoon/TimeSeries/NeuralForecast_test/~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/neuralforecast/common/_base_auto.py:436) )
    [437](https://vscode-remote+ssh-002dremote-002btrain2.vscode-resource.vscode-cdn.net/sswoon/TimeSeries/NeuralForecast_test/~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/neuralforecast/common/_base_auto.py:437) self.results = results
    [439](https://vscode-remote+ssh-002dremote-002btrain2.vscode-resource.vscode-cdn.net/sswoon/TimeSeries/NeuralForecast_test/~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/neuralforecast/common/_base_auto.py:439) # Added attributes for compatibility with NeuralForecast core

File ~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/neuralforecast/common/_base_auto.py:362, in BaseAuto._fit_model(self, cls_model, config, dataset, val_size, test_size, distributed_config)
    [358](https://vscode-remote+ssh-002dremote-002btrain2.vscode-resource.vscode-cdn.net/sswoon/TimeSeries/NeuralForecast_test/~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/neuralforecast/common/_base_auto.py:358) def _fit_model(
    [359](https://vscode-remote+ssh-002dremote-002btrain2.vscode-resource.vscode-cdn.net/sswoon/TimeSeries/NeuralForecast_test/~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/neuralforecast/common/_base_auto.py:359)     self, cls_model, config, dataset, val_size, test_size, distributed_config=None
    [360](https://vscode-remote+ssh-002dremote-002btrain2.vscode-resource.vscode-cdn.net/sswoon/TimeSeries/NeuralForecast_test/~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/neuralforecast/common/_base_auto.py:360) ):
    [361](https://vscode-remote+ssh-002dremote-002btrain2.vscode-resource.vscode-cdn.net/sswoon/TimeSeries/NeuralForecast_test/~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/neuralforecast/common/_base_auto.py:361)     model = cls_model(**config)
--> [362](https://vscode-remote+ssh-002dremote-002btrain2.vscode-resource.vscode-cdn.net/sswoon/TimeSeries/NeuralForecast_test/~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/neuralforecast/common/_base_auto.py:362)     model = model.fit(
    [363](https://vscode-remote+ssh-002dremote-002btrain2.vscode-resource.vscode-cdn.net/sswoon/TimeSeries/NeuralForecast_test/~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/neuralforecast/common/_base_auto.py:363)         dataset,
    [364](https://vscode-remote+ssh-002dremote-002btrain2.vscode-resource.vscode-cdn.net/sswoon/TimeSeries/NeuralForecast_test/~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/neuralforecast/common/_base_auto.py:364)         val_size=val_size,
    [365](https://vscode-remote+ssh-002dremote-002btrain2.vscode-resource.vscode-cdn.net/sswoon/TimeSeries/NeuralForecast_test/~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/neuralforecast/common/_base_auto.py:365)         test_size=test_size,
    [366](https://vscode-remote+ssh-002dremote-002btrain2.vscode-resource.vscode-cdn.net/sswoon/TimeSeries/NeuralForecast_test/~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/neuralforecast/common/_base_auto.py:366)         distributed_config=distributed_config,
    [367](https://vscode-remote+ssh-002dremote-002btrain2.vscode-resource.vscode-cdn.net/sswoon/TimeSeries/NeuralForecast_test/~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/neuralforecast/common/_base_auto.py:367)     )
    [368](https://vscode-remote+ssh-002dremote-002btrain2.vscode-resource.vscode-cdn.net/sswoon/TimeSeries/NeuralForecast_test/~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/neuralforecast/common/_base_auto.py:368)     return model

File ~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/neuralforecast/common/_base_multivariate.py:547, in BaseMultivariate.fit(self, dataset, val_size, test_size, random_seed, distributed_config)
    [543](https://vscode-remote+ssh-002dremote-002btrain2.vscode-resource.vscode-cdn.net/sswoon/TimeSeries/NeuralForecast_test/~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/neuralforecast/common/_base_multivariate.py:543) if distributed_config is not None:
    [544](https://vscode-remote+ssh-002dremote-002btrain2.vscode-resource.vscode-cdn.net/sswoon/TimeSeries/NeuralForecast_test/~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/neuralforecast/common/_base_multivariate.py:544)     raise ValueError(
    [545](https://vscode-remote+ssh-002dremote-002btrain2.vscode-resource.vscode-cdn.net/sswoon/TimeSeries/NeuralForecast_test/~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/neuralforecast/common/_base_multivariate.py:545)         "multivariate models cannot be trained using distributed data parallel."
    [546](https://vscode-remote+ssh-002dremote-002btrain2.vscode-resource.vscode-cdn.net/sswoon/TimeSeries/NeuralForecast_test/~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/neuralforecast/common/_base_multivariate.py:546)     )
--> [547](https://vscode-remote+ssh-002dremote-002btrain2.vscode-resource.vscode-cdn.net/sswoon/TimeSeries/NeuralForecast_test/~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/neuralforecast/common/_base_multivariate.py:547) return self._fit(
    [548](https://vscode-remote+ssh-002dremote-002btrain2.vscode-resource.vscode-cdn.net/sswoon/TimeSeries/NeuralForecast_test/~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/neuralforecast/common/_base_multivariate.py:548)     dataset=dataset,
    [549](https://vscode-remote+ssh-002dremote-002btrain2.vscode-resource.vscode-cdn.net/sswoon/TimeSeries/NeuralForecast_test/~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/neuralforecast/common/_base_multivariate.py:549)     batch_size=self.n_series,
    [550](https://vscode-remote+ssh-002dremote-002btrain2.vscode-resource.vscode-cdn.net/sswoon/TimeSeries/NeuralForecast_test/~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/neuralforecast/common/_base_multivariate.py:550)     valid_batch_size=self.n_series,
    [551](https://vscode-remote+ssh-002dremote-002btrain2.vscode-resource.vscode-cdn.net/sswoon/TimeSeries/NeuralForecast_test/~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/neuralforecast/common/_base_multivariate.py:551)     val_size=val_size,
    [552](https://vscode-remote+ssh-002dremote-002btrain2.vscode-resource.vscode-cdn.net/sswoon/TimeSeries/NeuralForecast_test/~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/neuralforecast/common/_base_multivariate.py:552)     test_size=test_size,
    [553](https://vscode-remote+ssh-002dremote-002btrain2.vscode-resource.vscode-cdn.net/sswoon/TimeSeries/NeuralForecast_test/~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/neuralforecast/common/_base_multivariate.py:553)     random_seed=random_seed,
    [554](https://vscode-remote+ssh-002dremote-002btrain2.vscode-resource.vscode-cdn.net/sswoon/TimeSeries/NeuralForecast_test/~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/neuralforecast/common/_base_multivariate.py:554)     shuffle_train=False,
    [555](https://vscode-remote+ssh-002dremote-002btrain2.vscode-resource.vscode-cdn.net/sswoon/TimeSeries/NeuralForecast_test/~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/neuralforecast/common/_base_multivariate.py:555)     distributed_config=None,
    [556](https://vscode-remote+ssh-002dremote-002btrain2.vscode-resource.vscode-cdn.net/sswoon/TimeSeries/NeuralForecast_test/~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/neuralforecast/common/_base_multivariate.py:556) )

File ~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/neuralforecast/common/_base_model.py:356, in BaseModel._fit(self, dataset, batch_size, valid_batch_size, val_size, test_size, random_seed, shuffle_train, distributed_config)
    [354](https://vscode-remote+ssh-002dremote-002btrain2.vscode-resource.vscode-cdn.net/sswoon/TimeSeries/NeuralForecast_test/~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/neuralforecast/common/_base_model.py:354) model = self
    [355](https://vscode-remote+ssh-002dremote-002btrain2.vscode-resource.vscode-cdn.net/sswoon/TimeSeries/NeuralForecast_test/~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/neuralforecast/common/_base_model.py:355) trainer = pl.Trainer(**model.trainer_kwargs)
--> [356](https://vscode-remote+ssh-002dremote-002btrain2.vscode-resource.vscode-cdn.net/sswoon/TimeSeries/NeuralForecast_test/~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/neuralforecast/common/_base_model.py:356) trainer.fit(model, datamodule=datamodule)
    [357](https://vscode-remote+ssh-002dremote-002btrain2.vscode-resource.vscode-cdn.net/sswoon/TimeSeries/NeuralForecast_test/~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/neuralforecast/common/_base_model.py:357) model.metrics = trainer.callback_metrics
    [358](https://vscode-remote+ssh-002dremote-002btrain2.vscode-resource.vscode-cdn.net/sswoon/TimeSeries/NeuralForecast_test/~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/neuralforecast/common/_base_model.py:358) model.__dict__.pop("_trainer", None)

File ~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py:538, in Trainer.fit(self, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path)
    [536](https://vscode-remote+ssh-002dremote-002btrain2.vscode-resource.vscode-cdn.net/sswoon/TimeSeries/NeuralForecast_test/~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py:536) self.state.status = TrainerStatus.RUNNING
    [537](https://vscode-remote+ssh-002dremote-002btrain2.vscode-resource.vscode-cdn.net/sswoon/TimeSeries/NeuralForecast_test/~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py:537) self.training = True
--> [538](https://vscode-remote+ssh-002dremote-002btrain2.vscode-resource.vscode-cdn.net/sswoon/TimeSeries/NeuralForecast_test/~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py:538) call._call_and_handle_interrupt(
    [539](https://vscode-remote+ssh-002dremote-002btrain2.vscode-resource.vscode-cdn.net/sswoon/TimeSeries/NeuralForecast_test/~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py:539)     self, self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path
    [540](https://vscode-remote+ssh-002dremote-002btrain2.vscode-resource.vscode-cdn.net/sswoon/TimeSeries/NeuralForecast_test/~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py:540) )

File ~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py:46, in _call_and_handle_interrupt(trainer, trainer_fn, *args, **kwargs)
     [44](https://vscode-remote+ssh-002dremote-002btrain2.vscode-resource.vscode-cdn.net/sswoon/TimeSeries/NeuralForecast_test/~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py:44) try:
     [45](https://vscode-remote+ssh-002dremote-002btrain2.vscode-resource.vscode-cdn.net/sswoon/TimeSeries/NeuralForecast_test/~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py:45)     if trainer.strategy.launcher is not None:
---> [46](https://vscode-remote+ssh-002dremote-002btrain2.vscode-resource.vscode-cdn.net/sswoon/TimeSeries/NeuralForecast_test/~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py:46)         return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
     [47](https://vscode-remote+ssh-002dremote-002btrain2.vscode-resource.vscode-cdn.net/sswoon/TimeSeries/NeuralForecast_test/~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py:47)     return trainer_fn(*args, **kwargs)
     [49](https://vscode-remote+ssh-002dremote-002btrain2.vscode-resource.vscode-cdn.net/sswoon/TimeSeries/NeuralForecast_test/~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py:49) except _TunerExitException:

File ~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/pytorch_lightning/strategies/launchers/multiprocessing.py:144, in _MultiProcessingLauncher.launch(self, function, trainer, *args, **kwargs)
    [136](https://vscode-remote+ssh-002dremote-002btrain2.vscode-resource.vscode-cdn.net/sswoon/TimeSeries/NeuralForecast_test/~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/pytorch_lightning/strategies/launchers/multiprocessing.py:136) process_context = mp.start_processes(
    [137](https://vscode-remote+ssh-002dremote-002btrain2.vscode-resource.vscode-cdn.net/sswoon/TimeSeries/NeuralForecast_test/~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/pytorch_lightning/strategies/launchers/multiprocessing.py:137)     self._wrapping_function,
    [138](https://vscode-remote+ssh-002dremote-002btrain2.vscode-resource.vscode-cdn.net/sswoon/TimeSeries/NeuralForecast_test/~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/pytorch_lightning/strategies/launchers/multiprocessing.py:138)     args=process_args,
   (...)
    [141](https://vscode-remote+ssh-002dremote-002btrain2.vscode-resource.vscode-cdn.net/sswoon/TimeSeries/NeuralForecast_test/~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/pytorch_lightning/strategies/launchers/multiprocessing.py:141)     join=False,  # we will join ourselves to get the process references
    [142](https://vscode-remote+ssh-002dremote-002btrain2.vscode-resource.vscode-cdn.net/sswoon/TimeSeries/NeuralForecast_test/~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/pytorch_lightning/strategies/launchers/multiprocessing.py:142) )
    [143](https://vscode-remote+ssh-002dremote-002btrain2.vscode-resource.vscode-cdn.net/sswoon/TimeSeries/NeuralForecast_test/~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/pytorch_lightning/strategies/launchers/multiprocessing.py:143) self.procs = process_context.processes
--> [144](https://vscode-remote+ssh-002dremote-002btrain2.vscode-resource.vscode-cdn.net/sswoon/TimeSeries/NeuralForecast_test/~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/pytorch_lightning/strategies/launchers/multiprocessing.py:144) while not process_context.join():
    [145](https://vscode-remote+ssh-002dremote-002btrain2.vscode-resource.vscode-cdn.net/sswoon/TimeSeries/NeuralForecast_test/~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/pytorch_lightning/strategies/launchers/multiprocessing.py:145)     pass
    [147](https://vscode-remote+ssh-002dremote-002btrain2.vscode-resource.vscode-cdn.net/sswoon/TimeSeries/NeuralForecast_test/~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/pytorch_lightning/strategies/launchers/multiprocessing.py:147) worker_output = return_queue.get()

File ~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/torch/multiprocessing/spawn.py:189, in ProcessContext.join(self, timeout)
    [187](https://vscode-remote+ssh-002dremote-002btrain2.vscode-resource.vscode-cdn.net/sswoon/TimeSeries/NeuralForecast_test/~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/torch/multiprocessing/spawn.py:187) msg = "\n\n-- Process %d terminated with the following error:\n" % error_index
    [188](https://vscode-remote+ssh-002dremote-002btrain2.vscode-resource.vscode-cdn.net/sswoon/TimeSeries/NeuralForecast_test/~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/torch/multiprocessing/spawn.py:188) msg += original_trace
--> [189](https://vscode-remote+ssh-002dremote-002btrain2.vscode-resource.vscode-cdn.net/sswoon/TimeSeries/NeuralForecast_test/~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/torch/multiprocessing/spawn.py:189) raise ProcessRaisedException(msg, error_index, failed_process.pid)

ProcessRaisedException: 

-- Process 2 terminated with the following error:
Traceback (most recent call last):
  File "anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 76, in _wrap
    fn(i, *args)
  File "anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/pytorch_lightning/strategies/launchers/multiprocessing.py", line 173, in _wrapping_function
    results = function(*args, **kwargs)
  File "anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 574, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 981, in _run
    results = self._run_stage()
  File "anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1025, in _run_stage
    self.fit_loop.run()
  File "anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 205, in run
    self.advance()
  File "anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 363, in advance
    self.epoch_loop.run(self._data_fetcher)
  File "anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 140, in run
    self.advance(data_fetcher)
  File "anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 250, in advance
    batch_output = self.automatic_optimization.run(trainer.optimizers[0], batch_idx, kwargs)
  File "anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/automatic.py", line 190, in run
    self._optimizer_step(batch_idx, closure)
  File "anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/automatic.py", line 268, in _optimizer_step
    call._call_lightning_module_hook(
  File "anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 167, in _call_lightning_module_hook
    output = fn(*args, **kwargs)
  File "anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/pytorch_lightning/core/module.py", line 1306, in optimizer_step
    optimizer.step(closure=optimizer_closure)
  File "anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/pytorch_lightning/core/optimizer.py", line 153, in step
    step_output = self._strategy.optimizer_step(self._optimizer, closure, **kwargs)
  File "anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/pytorch_lightning/strategies/ddp.py", line 270, in optimizer_step
    optimizer_output = super().optimizer_step(optimizer, closure, model, **kwargs)
  File "anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/pytorch_lightning/strategies/strategy.py", line 238, in optimizer_step
    return self.precision_plugin.optimizer_step(optimizer, model=model, closure=closure, **kwargs)
  File "anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/pytorch_lightning/plugins/precision/precision.py", line 122, in optimizer_step
    return optimizer.step(closure=closure, **kwargs)
  File "anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/torch/optim/lr_scheduler.py", line 130, in wrapper
    return func.__get__(opt, opt.__class__)(*args, **kwargs)
  File "anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/torch/optim/optimizer.py", line 484, in wrapper
    out = func(*args, **kwargs)
  File "anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/torch/optim/optimizer.py", line 89, in _use_grad
    ret = func(self, *args, **kwargs)
  File "anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/torch/optim/adam.py", line 205, in step
    loss = closure()
  File "anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/pytorch_lightning/plugins/precision/precision.py", line 108, in _wrap_closure
    closure_result = closure()
  File "anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/automatic.py", line 144, in __call__
    self._result = self.closure(*args, **kwargs)
  File "anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/automatic.py", line 129, in closure
    step_output = self._step_fn()
  File "anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/automatic.py", line 317, in _training_step
    training_step_output = call._call_strategy_hook(trainer, "training_step", *kwargs.values())
  File "anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 319, in _call_strategy_hook
    output = fn(*args, **kwargs)
  File "anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/pytorch_lightning/strategies/strategy.py", line 389, in training_step
    return self._forward_redirection(self.model, self.lightning_module, "training_step", *args, **kwargs)
  File "anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/pytorch_lightning/strategies/strategy.py", line 640, in __call__
    wrapper_output = wrapper_module(*args, **kwargs)
  File "anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "seoul/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1636, in forward
    else self._run_ddp_forward(*inputs, **kwargs)
  File "anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1454, in _run_ddp_forward
    return self.module(*inputs, **kwargs)  # type: ignore[index]
  File "anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/pytorch_lightning/strategies/strategy.py", line 633, in wrapped_forward
    out = method(*_args, **_kwargs)
  File "anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/neuralforecast/common/_base_multivariate.py", line 371, in training_step
    output = self(windows_batch)
  File "anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/neuralforecast/models/timemixer.py", line 645, in forward
    y_pred = self.forecast(insample_y, x_mark_enc, x_mark_dec)
  File "anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/neuralforecast/models/timemixer.py", line 576, in forecast
    x = self.normalize_layers[i](x, "norm")
  File "anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/neuralforecast/common/_modules.py", line 557, in forward
    x = self._normalize(x)
  File "anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/neuralforecast/common/_modules.py", line 588, in _normalize
    x = x * self.affine_weight
RuntimeError: The size of tensor a (2) must match the size of tensor b (5) at non-singleton dimension 2

Issue Severity

High: It blocks me from completing my task.

skmanzg commented 3 hours ago

[plus] Since I have four GPUs, I set gpus = 4 and it seems gpus are not detected and freezed. I had to set gpus = 1 to avoid this problem. According to the document, gpus is the number of gpus that I have. I wonder why this is not working either.