allenai / OLMoE

OLMoE: Open Mixture-of-Experts Language Models
https://arxiv.org/abs/2409.02060
Apache License 2.0
364 stars 27 forks source link

Implementing MoE Sparse Upcycling #9

Open adumans opened 1 week ago

adumans commented 1 week ago

Hello OLMoE Authors:

I have read the updates on the Sparse upcycling method in readme and tried to implement it. I want to reproduce the conclusions of Sparse Upcycling in your paper that load OLMo-1B (0724) at 2T tokens.

I downloaded the corresponding checkpoint from Hugging Face, but the hf version OLMo-1B-0724-hf(revision="step954000-tokens2000B") has 2 safetensors(model-00001-of-00002.safetensors, model-00002-of-00002.safetensors), may be "safetensors.torch.load_file" used in sparsify_ckpt_unsharded.py can't load two safetensors. So I downloaded the OLMo-1B, but this version has no "tokens2000B", only "step477000-tokens2001B" is available.

So Could you please tell me: 1) Can OLMo-1B(revision=step477000-tokens2001B) reproduce the conclusions in 4.1.5 Sparse Upcycling? Is it the same with OLMo-1B-0724-hf(revision=step954000-tokens2000B)? 2) Or is there any other code that can load two safetensors(model-00001-of-00002.safetensors, model-00002-of-00002.safetensors in OLMo-1B-0724-hf)and run convertion from a dense model to MoE?

Thanks!

BTW, when I loaded OLMo-1B(revision=step477000-tokens2001B) using sparsify_ckpt_unsharded.py , name in state_dict is like "model.transformer.blocks.4.ff_proj.weight", the index of block is 3 not 2, but line 29 and line 51 is block_num = int(key.split(".")[2])

Muennighoff commented 1 week ago

Thanks for the comment! I've added more details to https://github.com/allenai/OLMoE/blob/main/README.md#other-design-choices ; lmk if you still run into problems!

adumans commented 6 days ago

Thanks for the comment! I've added more details to https://github.com/allenai/OLMoE/blob/main/README.md#other-design-choices ; lmk if you still run into problems!

I have run a small demo with a small portion of data, and the content in the output path looks like this: config.yaml data-indices latest step1000 step2000 step3000 step4000 step4229 train_data

The structure inside the stepxxx (for example, step4000) path is like this: `

[ 84] step4000 ├── [3.7K] config.yaml ├── [ 275] model │   ├── [ 90K] metadata.json │   ├── [3.2G] rank_0.safetensors │   ├── [3.2G] rank_1.safetensors │   ├── [3.2G] rank_2.safetensors │   ├── [3.2G] rank_3.safetensors │   ├── [3.2G] rank_4.safetensors │   ├── [3.2G] rank_5.safetensors │   ├── [3.2G] rank_6.safetensors │   └── [3.2G] rank_7.safetensors ├── [ 275] optim │   ├── [207K] metadata.json │   ├── [6.4G] rank_0.safetensors │   ├── [6.4G] rank_1.safetensors │   ├── [6.4G] rank_2.safetensors │   ├── [6.4G] rank_3.safetensors │   ├── [6.4G] rank_4.safetensors │   ├── [6.4G] rank_5.safetensors │   ├── [6.4G] rank_6.safetensors │   └── [6.4G] rank_7.safetensors └── [ 134] train ├── [ 14K] rank0.pt ├── [ 14K] rank1.pt ├── [ 14K] rank2.pt ├── [ 14K] rank3.pt ├── [ 14K] rank4.pt ├── [ 14K] rank5.pt ├── [ 14K] rank6.pt └── [ 14K] rank7.pt `

Is there a script available that can convert the output ckpt (stepxxx) into a hf(Hugging Face) format model?

Muennighoff commented 6 days ago

I just added that as 7. here: https://github.com/allenai/OLMoE/tree/main?tab=readme-ov-file#pretraining ; Lmk if still unclear!

adumans commented 5 days ago

I just added that as 7. here: https://github.com/allenai/OLMoE/tree/main?tab=readme-ov-file#pretraining ; Lmk if still unclear!

I tried running the model convertion to hf format, but I got: "KeyError: transformer.blocks.0.q_norm.weight".

So I traced back this error and found that the checkpoint you provided (here https://huggingface.co/allenai/OLMo-1B-0724-954000steps-unsharded) doesn't contain the parameters related to self-attention (q_norm, k_norm, v_norm, o_proj, etc), it only includes the parameters related to the experts (FFN, such as ffn.experts.mlp.w1, etc).

Do I need to run another script to merge these parameters? Or could you provide a checkpoint that contains all parameters? (also, an MoE weight upcycled at 2T tokens as figure 8 in the paper)

Muennighoff commented 5 days ago

For the upcycling ablation we do not use QK Norm, so just deactivate that. You can take a look at this config: https://wandb.ai/ai2-llm/olmoe/runs/1w3srbb3/overview

adumans commented 4 days ago

doesn't contain the parameters related to self-attention

The configuration I used for running the aforementioned demo is similar to this one, the output safetensors do not contain the parameters related to self-attention.

Do you mean that the parameters related to self-attention (q_norm, k_norm, v_norm, o_proj, etc) kept frozen throughout the upcycling ablation Continued pretraining process? In other words, is this part of the parameters consistent with the dense model?

Muennighoff commented 4 days ago

they were not used in the upcycling because olmo 1b does not have q_norm, k_norm, v_norm

adumans commented 4 days ago

they were not used in the upcycling because olmo 1b does not have q_norm, k_norm, v_norm

But the olmoe model has q_norm, k_norm, v_norm parameters, where did they com from? (as olmoe is upcycled from olmo)

Muennighoff commented 4 days ago

olmoe is not upcycled from olmo, sorry for the confusion. Is it not clear from the paper https://arxiv.org/abs/2409.02060 ?

adumans commented 4 days ago

olmoe is not upcycled from olmo, sorry for the confusion. Is it not clear from the paper https://arxiv.org/abs/2409.02060 ?

Sorry, I think I misunderstood this parts. Neither the upcycled MoE Nor the “training from scratch” MoE in figure 8 has the same strcuture with the final released olmoe version

Muennighoff commented 4 days ago

yes they have slightly different hyperparameters