epfLLM / Megatron-LLM

distributed trainer for LLMs

Other

504 stars 73 forks source link

[Megatron Base Version] Would mind share the based version of Megatron ? #67

Closed dumpmemory closed 9 months ago

dumpmemory commented 10 months ago

I have found the code for get_checkpoint_name(s) and DistributedOptimizer are different. The upstream version had fix many bugs . Would u mind rebase it ?

dumpmemory commented 10 months ago

_copy_model_params_to_main_params is missing in DistributedOptimizer

and get distributed_optimizer name logic is also different.

In currently code, if i add use_distributed_optimizer, there will be error for data parallel group is not initialized

dumpmemory commented 10 months ago

68 add missing function

martinjaggi commented 10 months ago

the base version of NVIDIA/Megatron-LM which our code branched of is March 28, 2023.

since then the structure of the base repo code has been refactored a bit by the NVIDIA team also. in terms of functionality though not much was changed. or could you point to a concrete bug which is present in our code and not in their updated one, which impacts the use?

so far we haven't encountered bugs in our training with Llama2 models of all sizes.

of course it would be best to rebase the code on top of newest megatron-lm, but this would take quite some effort. if anyone would like to help preparing the code that would be more than welcome

dumpmemory commented 10 months ago

if i have time, i am willing to do so. would mind provding the commit hash which u modified from. current repo just remove the git logs from NVIDIA/Megatron-LM

kylematoba commented 9 months ago

Hi, sorry, we seem to have lost the actual commit, but we're pretty sure it's 035cae2ef9cc770784a3c3f2f46ecf9cd0d1380c, based on the timing.

13416157913 commented 9 months ago

e seem to have lost the actual com

I meet the same issue,when I use --use_distributed_optimizer it error data parallel group is not initialized. How to solve it?

dumpmemory commented 9 months ago

e seem to have lost the actual com

I meet the same issue,when I use --use_distributed_optimizer it error data parallel group is not initialized. How to solve it?

pls see this https://github.com/epfLLM/Megatron-LLM/pull/68