Open adumans opened 2 months ago
Thanks for the comment! I've added more details to https://github.com/allenai/OLMoE/blob/main/README.md#other-design-choices ; lmk if you still run into problems!
Thanks for the comment! I've added more details to https://github.com/allenai/OLMoE/blob/main/README.md#other-design-choices ; lmk if you still run into problems!
I have run a small demo with a small portion of data, and the content in the output path looks like this:
config.yaml data-indices latest step1000 step2000 step3000 step4000 step4229 train_data
The structure inside the stepxxx (for example, step4000) path is like this: `
[ 84] step4000 ├── [3.7K] config.yaml ├── [ 275] model │ ├── [ 90K] metadata.json │ ├── [3.2G] rank_0.safetensors │ ├── [3.2G] rank_1.safetensors │ ├── [3.2G] rank_2.safetensors │ ├── [3.2G] rank_3.safetensors │ ├── [3.2G] rank_4.safetensors │ ├── [3.2G] rank_5.safetensors │ ├── [3.2G] rank_6.safetensors │ └── [3.2G] rank_7.safetensors ├── [ 275] optim │ ├── [207K] metadata.json │ ├── [6.4G] rank_0.safetensors │ ├── [6.4G] rank_1.safetensors │ ├── [6.4G] rank_2.safetensors │ ├── [6.4G] rank_3.safetensors │ ├── [6.4G] rank_4.safetensors │ ├── [6.4G] rank_5.safetensors │ ├── [6.4G] rank_6.safetensors │ └── [6.4G] rank_7.safetensors └── [ 134] train ├── [ 14K] rank0.pt ├── [ 14K] rank1.pt ├── [ 14K] rank2.pt ├── [ 14K] rank3.pt ├── [ 14K] rank4.pt ├── [ 14K] rank5.pt ├── [ 14K] rank6.pt └── [ 14K] rank7.pt `
Is there a script available that can convert the output ckpt (stepxxx) into a hf(Hugging Face) format model?
I just added that as 7.
here: https://github.com/allenai/OLMoE/tree/main?tab=readme-ov-file#pretraining ; Lmk if still unclear!
I just added that as
7.
here: https://github.com/allenai/OLMoE/tree/main?tab=readme-ov-file#pretraining ; Lmk if still unclear!
I tried running the model convertion to hf format, but I got: "KeyError: transformer.blocks.0.q_norm.weight".
So I traced back this error and found that the checkpoint you provided (here https://huggingface.co/allenai/OLMo-1B-0724-954000steps-unsharded) doesn't contain the parameters related to self-attention (q_norm, k_norm, v_norm, o_proj, etc), it only includes the parameters related to the experts (FFN, such as ffn.experts.mlp.w1, etc).
Do I need to run another script to merge these parameters? Or could you provide a checkpoint that contains all parameters? (also, an MoE weight upcycled at 2T tokens as figure 8 in the paper)
For the upcycling ablation we do not use QK Norm, so just deactivate that. You can take a look at this config: https://wandb.ai/ai2-llm/olmoe/runs/1w3srbb3/overview
doesn't contain the parameters related to self-attention
The configuration I used for running the aforementioned demo is similar to this one, the output safetensors do not contain the parameters related to self-attention.
Do you mean that the parameters related to self-attention (q_norm, k_norm, v_norm, o_proj, etc) kept frozen throughout the upcycling ablation Continued pretraining process? In other words, is this part of the parameters consistent with the dense model?
they were not used in the upcycling because olmo 1b does not have q_norm, k_norm, v_norm
they were not used in the upcycling because olmo 1b does not have q_norm, k_norm, v_norm
But the olmoe model has q_norm, k_norm, v_norm parameters, where did they com from? (as olmoe is upcycled from olmo)
olmoe is not upcycled from olmo, sorry for the confusion. Is it not clear from the paper https://arxiv.org/abs/2409.02060 ?
olmoe is not upcycled from olmo, sorry for the confusion. Is it not clear from the paper https://arxiv.org/abs/2409.02060 ?
Sorry, I think I misunderstood this parts. Neither the upcycled MoE Nor the “training from scratch” MoE in figure 8 has the same strcuture with the final released olmoe version
yes they have slightly different hyperparameters
yes they have slightly different hyperparameters
Thanks! And In the experiment of upcycling(figure 8), was any other data strategy applied to the 610 billion tokens (such as sampling, data mixing, etc)? As I noticed a new class(IterableDataset) was created to solve the problems of deterministic shuffling.
It is the same dataset as used for OLMo 1B and forwarded to start from the same batch where OLMo 1B finished (via --fast_forward_batches=136153
) see wandb
Hello OLMoE Authors:
I have read the updates on the Sparse upcycling method in readme and tried to implement it. I want to reproduce the conclusions of Sparse Upcycling in your paper that load OLMo-1B (0724) at 2T tokens.
I downloaded the corresponding checkpoint from Hugging Face, but the hf version OLMo-1B-0724-hf(revision="step954000-tokens2000B") has 2 safetensors(model-00001-of-00002.safetensors, model-00002-of-00002.safetensors), may be "
safetensors.torch.load_file
" used insparsify_ckpt_unsharded.py
can't load two safetensors. So I downloaded the OLMo-1B, but this version has no "tokens2000B", only "step477000-tokens2001B" is available.So Could you please tell me: 1) Can OLMo-1B(revision=step477000-tokens2001B) reproduce the conclusions in 4.1.5 Sparse Upcycling? Is it the same with OLMo-1B-0724-hf(revision=step954000-tokens2000B)? 2) Or is there any other code that can load two safetensors(model-00001-of-00002.safetensors, model-00002-of-00002.safetensors in OLMo-1B-0724-hf)and run convertion from a dense model to MoE?
Thanks!
BTW, when I loaded OLMo-1B(revision=step477000-tokens2001B) using
sparsify_ckpt_unsharded.py
, name in state_dict is like "model.transformer.blocks.4.ff_proj.weight", the index of block is 3 not 2, but line 29 and line 51 is block_num = int(key.split(".")[2])