Closed zhangtianhong-1998 closed 4 months ago
I would very much like the author to further provide a script file consistent with the paper. I tried to set the parameters myself and described in the paper, but the test results could not be replicated.
Thank you very much for your interest in our work and the effort you have put into replicating our experimental results. We understand the challenges and difficulties that may arise when attempting to replicate the results of deep learning models, especially when it involves using multiple GPUs and advanced optimization techniques such as DeepSpeed's ZeRO.
Our training process utilizes the ZeRO-2 optimizer under the DeepSpeed framework, which is a technology optimized and accelerated for large-scale training. During this process, we use gradient accumulation to simulate larger batch sizes, which helps us manage limited hardware resources but also introduces randomness, as gradient accumulation affects weight updates. Secondly, we have adopted mixed precision training, specifically using bfloat16 (bf16) precision, which can significantly reduce memory usage and speed up training. However, adopting lower precision in floating-point representation also introduces additional numerical computation errors, another source of randomness in results. It's important to note that these techniques are very common and necessary for training large models in modern deep learning frameworks, although they may cause slight fluctuations in results. Additionally, the deep learning runtime environment, framework versions, CUDA versions, and similar factors might introduce a certain degree of randomness.
If you have more specific questions or need assistance, please feel free to contact us by email for further discussion.
I would very much like the author to further provide a script file consistent with the paper. I tried to set the parameters myself and described in the paper, but the test results could not be replicated.