KimMeen / Time-LLM

[ICLR 2024] Official implementation of " 🦙 Time-LLM: Time Series Forecasting by Reprogramming Large Language Models"
https://arxiv.org/abs/2310.01728
Apache License 2.0
1.29k stars 221 forks source link

Unable to reproduce results #109

Closed fil-mp closed 3 months ago

fil-mp commented 3 months ago

Hello!

Congrats on this work! I too am not able to replicate your results as reported here: https://github.com/KimMeen/Time-LLM/issues/51.

Would it be possible for you to share the hyperparameters(basically the seed) that yielded the best results in your experiments? Are they the ones mentioned in the scripts directory? I understand the hardware setup may be the reason to this discrepancy. I was wondering if you have conducted any experiments using fewer A80 GPUs to observe how the results vary based on the hardware configuration. I would greatly appreciate any insights or suggestions you could provide to help me better understand and replicate your findings.

Thank you in advance for any reply!

kwuking commented 3 months ago

Thank you for your interest in our work. I currently do not have the A80 GPU model. Instead, I am using 8 A100 GPUs with 80GB each for experimentation. Considering the commonly used mixed-precision mode in LLM training and the gradient clipping operation in deepspeed, ensuring consistent results for deep learning, especially LLM, across different devices and environments is indeed challenging. We understand your concerns and are working hard to overcome this issue in our latest and upcoming work. We will communicate any progress we make in this area with everyone in a timely manner.

fil-mp commented 3 months ago

Sorry, I meant fewer A100 GPUs(or even A40s). Thank you for your reply.

kwuking commented 3 months ago

Sorry, I meant fewer A100 GPUs(or even A40s). Thank you for your reply.

I currently do not have access to the A40 device model to conduct tests. I have experimented with using 2 A100 GPUs, but reducing the overall batch size did indeed result in slower speed, and I have not yet conduct a complete test under these conditions. Additionally, I have tried other model compression methods to improve speed and reduce memory usage, but it did lead to changes in accuracy. I have not yet found a good solution for this and am seeking support from our engineering team to explore whether we can accelerate the model from a compilation optimization perspective, such as using XLA technology, without affecting model accuracy. Honestly, achieving acceleration optimization for large models while maintaining accuracy is indeed a very difficult and challenging task. I am still actively exploring and would be delighted to share any preliminary results with you if I make progress.

fil-mp commented 3 months ago

Thank you so much! I am looking forward to it.

In the meantime, to test on my configuration, could you confirm that the seed yielding the best results is the one used in your run_main? And the hyperparameters are the ones in the bash files in your scripts directory?

That would be really helpful!

kwuking commented 3 months ago

Thank you so much! I am looking forward to it.

In the meantime, to test on my configuration, could you confirm that the seed yielding the best results is the one used in your run_main? And the hyperparameters are the ones in the bash files in your scripts directory?

That would be really helpful!

The results in our final paper are reported based on the mean obtained from multiple sets of random seeds. In our current open-source code, we have considered this issue and adhered to the widely used settings in TSLib. Given that most research currently utilizes similar settings, you can use this configuration for your experiments.

SajayR commented 3 months ago

I've been having dealing with the same issue and am running the inferences on 4-A100s with not a single run approaching anywhere near the paper's results (and instead are extremely similar to @fil-mp's reported evals). Would love to be able to get any help to replicate the results. Specially if the results were averages across multiple runs, it's kinda confusing why none of our iterations would get close to them. I'll continue playing around and try to figure out whats causing the issue.