allenai / OLMoE

OLMoE: Open Mixture-of-Experts Language Models
https://arxiv.org/abs/2409.02060
Apache License 2.0
478 stars 37 forks source link

Implementing MoE Sparse Upcycling #9

Open adumans opened 2 months ago

adumans commented 2 months ago

Hello OLMoE Authors:

I have read the updates on the Sparse upcycling method in readme and tried to implement it. I want to reproduce the conclusions of Sparse Upcycling in your paper that load OLMo-1B (0724) at 2T tokens.

I downloaded the corresponding checkpoint from Hugging Face, but the hf version OLMo-1B-0724-hf(revision="step954000-tokens2000B") has 2 safetensors(model-00001-of-00002.safetensors, model-00002-of-00002.safetensors), may be "safetensors.torch.load_file" used in sparsify_ckpt_unsharded.py can't load two safetensors. So I downloaded the OLMo-1B, but this version has no "tokens2000B", only "step477000-tokens2001B" is available.

So Could you please tell me: 1) Can OLMo-1B(revision=step477000-tokens2001B) reproduce the conclusions in 4.1.5 Sparse Upcycling? Is it the same with OLMo-1B-0724-hf(revision=step954000-tokens2000B)? 2) Or is there any other code that can load two safetensors(model-00001-of-00002.safetensors, model-00002-of-00002.safetensors in OLMo-1B-0724-hf)and run convertion from a dense model to MoE?

Thanks!

BTW, when I loaded OLMo-1B(revision=step477000-tokens2001B) using sparsify_ckpt_unsharded.py , name in state_dict is like "model.transformer.blocks.4.ff_proj.weight", the index of block is 3 not 2, but line 29 and line 51 is block_num = int(key.split(".")[2])

Muennighoff commented 2 months ago

Thanks for the comment! I've added more details to https://github.com/allenai/OLMoE/blob/main/README.md#other-design-choices ; lmk if you still run into problems!

adumans commented 2 months ago

Thanks for the comment! I've added more details to https://github.com/allenai/OLMoE/blob/main/README.md#other-design-choices ; lmk if you still run into problems!

I have run a small demo with a small portion of data, and the content in the output path looks like this: config.yaml data-indices latest step1000 step2000 step3000 step4000 step4229 train_data

The structure inside the stepxxx (for example, step4000) path is like this: `

[ 84] step4000 ├── [3.7K] config.yaml ├── [ 275] model │   ├── [ 90K] metadata.json │   ├── [3.2G] rank_0.safetensors │   ├── [3.2G] rank_1.safetensors │   ├── [3.2G] rank_2.safetensors │   ├── [3.2G] rank_3.safetensors │   ├── [3.2G] rank_4.safetensors │   ├── [3.2G] rank_5.safetensors │   ├── [3.2G] rank_6.safetensors │   └── [3.2G] rank_7.safetensors ├── [ 275] optim │   ├── [207K] metadata.json │   ├── [6.4G] rank_0.safetensors │   ├── [6.4G] rank_1.safetensors │   ├── [6.4G] rank_2.safetensors │   ├── [6.4G] rank_3.safetensors │   ├── [6.4G] rank_4.safetensors │   ├── [6.4G] rank_5.safetensors │   ├── [6.4G] rank_6.safetensors │   └── [6.4G] rank_7.safetensors └── [ 134] train ├── [ 14K] rank0.pt ├── [ 14K] rank1.pt ├── [ 14K] rank2.pt ├── [ 14K] rank3.pt ├── [ 14K] rank4.pt ├── [ 14K] rank5.pt ├── [ 14K] rank6.pt └── [ 14K] rank7.pt `

Is there a script available that can convert the output ckpt (stepxxx) into a hf(Hugging Face) format model?

Muennighoff commented 2 months ago

I just added that as 7. here: https://github.com/allenai/OLMoE/tree/main?tab=readme-ov-file#pretraining ; Lmk if still unclear!

adumans commented 2 months ago

I just added that as 7. here: https://github.com/allenai/OLMoE/tree/main?tab=readme-ov-file#pretraining ; Lmk if still unclear!

I tried running the model convertion to hf format, but I got: "KeyError: transformer.blocks.0.q_norm.weight".

So I traced back this error and found that the checkpoint you provided (here https://huggingface.co/allenai/OLMo-1B-0724-954000steps-unsharded) doesn't contain the parameters related to self-attention (q_norm, k_norm, v_norm, o_proj, etc), it only includes the parameters related to the experts (FFN, such as ffn.experts.mlp.w1, etc).

Do I need to run another script to merge these parameters? Or could you provide a checkpoint that contains all parameters? (also, an MoE weight upcycled at 2T tokens as figure 8 in the paper)

Muennighoff commented 2 months ago

For the upcycling ablation we do not use QK Norm, so just deactivate that. You can take a look at this config: https://wandb.ai/ai2-llm/olmoe/runs/1w3srbb3/overview

adumans commented 2 months ago

doesn't contain the parameters related to self-attention

The configuration I used for running the aforementioned demo is similar to this one, the output safetensors do not contain the parameters related to self-attention.

Do you mean that the parameters related to self-attention (q_norm, k_norm, v_norm, o_proj, etc) kept frozen throughout the upcycling ablation Continued pretraining process? In other words, is this part of the parameters consistent with the dense model?

Muennighoff commented 2 months ago

they were not used in the upcycling because olmo 1b does not have q_norm, k_norm, v_norm

adumans commented 2 months ago

they were not used in the upcycling because olmo 1b does not have q_norm, k_norm, v_norm

But the olmoe model has q_norm, k_norm, v_norm parameters, where did they com from? (as olmoe is upcycled from olmo)

Muennighoff commented 2 months ago

olmoe is not upcycled from olmo, sorry for the confusion. Is it not clear from the paper https://arxiv.org/abs/2409.02060 ?

adumans commented 2 months ago

olmoe is not upcycled from olmo, sorry for the confusion. Is it not clear from the paper https://arxiv.org/abs/2409.02060 ?

Sorry, I think I misunderstood this parts. Neither the upcycled MoE Nor the “training from scratch” MoE in figure 8 has the same strcuture with the final released olmoe version

Muennighoff commented 2 months ago

yes they have slightly different hyperparameters

adumans commented 2 months ago

yes they have slightly different hyperparameters

Thanks! And In the experiment of upcycling(figure 8), was any other data strategy applied to the 610 billion tokens (such as sampling, data mixing, etc)? As I noticed a new class(IterableDataset) was created to solve the problems of deterministic shuffling.

Muennighoff commented 2 months ago

It is the same dataset as used for OLMo 1B and forwarded to start from the same batch where OLMo 1B finished (via --fast_forward_batches=136153) see wandb