Open Jimmy-7664 opened 1 month ago
Thank you for your great question.
TL;DR: It comes from the unstable route training during the initial phase. You can set warmup_epoch to stabilize/reproduce results.
Below are detailed explanation.
It may be caused by suboptimal routing due to the unstable initial route learning. You can see each expert's performance by running test.py, confirming that they are incorrectly routed. To fix such You may set the small warmup_epoch (e.g., 5 epochs) to stabilize them.
Actually, the current training sequence of gating networks is suboptimal and needs improvement; Which means, it may mislead the input to the second- or third-best expert. It is our future research goal and we try our best to improve such suboptimal route selection.
Further questions are welcome
Thank you for your prompt reply. I modified the warm epoch to 5 and the results did get better but it doesn't look like it beat the baseline in the original article. is there something else I need to tweak? Here is the new log. I omit the first few epochs log to make it not so long.
Epoch: 048, Inference Time: 7.5892 secs
Epoch: 048, Train Loss: 3.0352, Train MAPE: 0.0679, Train RMSE: 5.1720, Valid Loss: 2.6798, Valid MAPE: 0.0738, Valid RMSE: 5.1405, Training Time: 185.3764/epoch
Early Termination!
Average Training Time: 187.4990 secs/epoch
Average Inference Time: 7.7179 secs
Training finished
The valid loss on best model is 2.6721
Evaluate best model on test data for horizon 1, Test MAE: 2.2504, Test MAPE: 0.0545, Test RMSE: 3.9607
Evaluate best model on test data for horizon 2, Test MAE: 2.4852, Test MAPE: 0.0621, Test RMSE: 4.6657
Evaluate best model on test data for horizon 3, Test MAE: 2.6497, Test MAPE: 0.0680, Test RMSE: 5.1444
Evaluate best model on test data for horizon 4, Test MAE: 2.7825, Test MAPE: 0.0731, Test RMSE: 5.5305
Evaluate best model on test data for horizon 5, Test MAE: 2.8924, Test MAPE: 0.0776, Test RMSE: 5.8434
Evaluate best model on test data for horizon 6, Test MAE: 2.9856, Test MAPE: 0.0815, Test RMSE: 6.1093
Evaluate best model on test data for horizon 7, Test MAE: 3.0703, Test MAPE: 0.0850, Test RMSE: 6.3443
Evaluate best model on test data for horizon 8, Test MAE: 3.1467, Test MAPE: 0.0881, Test RMSE: 6.5505
Evaluate best model on test data for horizon 9, Test MAE: 3.2150, Test MAPE: 0.0910, Test RMSE: 6.7311
Evaluate best model on test data for horizon 10, Test MAE: 3.2779, Test MAPE: 0.0936, Test RMSE: 6.8935
Evaluate best model on test data for horizon 11, Test MAE: 3.3353, Test MAPE: 0.0959, Test RMSE: 7.0382
Evaluate best model on test data for horizon 12, Test MAE: 3.3931, Test MAPE: 0.0983, Test RMSE: 7.1796
On average over 12 horizons, Test MAE: 2.9570, Test MAPE: 0.0807, Test RMSE: 5.9993
Total time spent: 9417.1009
One more question, I think after training TESTAM uses the same expert for all inputs instead of dynamically choosing different expert, am I correct in my understanding?
Looking forward to your reply
We encountered the same problem during the replication process, using 5 warmup epochs. The final MAE in the third step of PEMS-BAY was 1.385, the MAE in the sixth step was 1.687, and the MAE in the twelfth step was 1.952. This is significantly different from the results reported in the paper and does not exceed many baselines. Do we need to adjust the hyperparameters to achieve the results in the paper?
We noticed that there may exist improper routing, which chooses only one expert regardless of regression error. We are now testing the load balancing loss function to reduce such inferiority.
Even worse, in the case of the PEMS-BAY, TESTAM sometimes selects the "worst" expert :( In that case, the MAE could be much larger than the reported one.
We'll try our best to fix the issue and after the test, we'll update the code accordingly.
@Jimmy-7664 @randomforest1111 Thank you for your continuing interest in our paper!
The problem comes from the current version of Python and PyTorch blocked index-based in-place operation. We now revised our pseudo-label generation process accordingly. For your information, I've left notes in README.md file.
Even though we revised the pseudo-label generation and avoided selecting "improper experts," we still have some issues, such as routing may be biased toward one expert. We provide some functions that may be helpful for better routing, such as load balancing loss function or uncertainty measurements.
We still trying to improve our model, so keep touching with us!
Thank you again for your great attention and interest in our paper and I hope this change resolved your problems.
I ran the code according to the guide in the README without modifying the code, but the results I get are a bit different from the paper, is there any possible reason for this?
My log here: