Model Fine-Tuning Results Do Not Match the Author's Results

ReamonYim commented 3 weeks ago

Dear Author,

I have fine-tuned the ABR model according to the instructions using the provided hyperparameters. However, the results I obtained are noticeably different from the results reported in the paper. I have attached two figures to illustrate the problem:

Figure 1: Shows the loss curve and the reward curve during my fine-tuning process. Figure 2: Compares the baseline, my fine-tuned large model (purple curve), and your fine-tuned large model (green curve). As observed, my fine-tuned model performs significantly worse than the model provided by you, despite using the same hyperparameters. Could you please advise if there are specific hyperparameters I should further adjust, or if there are any additional configurations that I might have missed?

Thank you very much for your assistance!

Best regards, Reamon

1728204865282(1) 1728204894982

duowuyms commented 2 weeks ago

Hi, Reamon.

Thanks for using our codes!

We met a similar problem when we retrained a new model with the same hyperparameters used for the checkpoint on a different server. So we suspect that this problem could caused by the difference of hardware settings (e.g., cuda version, gpu specifications, etc.).

Here are some suggestions that may improve your results.

Try different settings for the hyperparameters such as "--w", "--target-return-scale", "--rank".
Try different random seeds for training. This may sound weird but it can produce magic sometime.

Plus, there are some additional suggestions which I think will very likely improve your results but require additional modifications on the codes. You can decide whether to take them or not.

Increase the size of experience pool.
The model may face the problem of exposure bias. So you may try scheduled sampling to improve the training process. See this blog for more details: https://www.activeloop.ai/resources/glossary/scheduled-sampling/

Thank you!

ReamonYim commented 5 days ago

I have tried adjusting several parameter values, and the test results are as follows: Adjustment: --target-return-scale does not change significantly Adjustment: --rank performance will decrease regardless of whether it is increased or decreased Adjustment: --w, performance will decrease after increasing it, and performance will be close to the author's model after decreasing it and increasing the epoch Finally, the best performance is achieved when w is adjusted to 10. If you encounter similar problems, you can also adjust and test it this way

duowuyms / NetLLM

Model Fine-Tuning Results Do Not Match the Author's Results #10