EleutherAI / gpt-neox

An implementation of model parallel autoregressive transformers on GPUs, based on the Megatron and DeepSpeed libraries
https://www.eleuther.ai/
Apache License 2.0
6.72k stars 981 forks source link

RuntimeError: Error(s) in loading state_dict for EmbeddingPipe: size mismatch for word_embeddings.weight #645

Open mcao516 opened 2 years ago

mcao516 commented 2 years ago

Describe the bug RuntimeError: Error(s) in loading state_dict for EmbeddingPipe: size mismatch for word_embeddings.weight: copying a param with shape torch.Size([25216, 6144]) from checkpoint, the shape in current model is torch.Size([50304, 6144]).

To Reproduce

  1. Download Slim weights
  2. Update the vocabulary and checkpoint path in ./configs/20B.yml (HFTokenizer is used)
  3. Run: ./deepy.py generate.py ./configs/20B.yml -i prompt.txt -o sample_outputs.txt

Screenshots image

Environment (please complete the following information):

jdagdelen commented 2 years ago

I'm experiencing this too. Not sure what I'm doing wrong. Downloaded the weights from here which the "fixed" link from #646. However, I also downloaded the slim weights and that seems to load ok, although the output from the model is gibberish.

FayZ676 commented 1 year ago

I am getting the same problem too when trying to train a 1-3B model.

To Reproduce:

  1. Download Slim weights
  2. Update ./configs/1-3B.yml as shown in the screen shots below.
  3. Run python ./deepy.py train.py -d configs 1-3B.yml

Screenshots: Screen Shot 2022-12-09 at 3 20 38 PM Screen Shot 2022-12-09 at 3 20 54 PM

Environment:

binglun30 commented 1 year ago

I also had the same problem, when using a single machine to load the slim weight downloaded on github, it reported a similar error, here is a screenshot of the error message image

Environment:

GPU's: 4x 3090 (96G)

djaym7 commented 1 year ago

What's the solution ? and why closed ?

StellaAthena commented 1 year ago

@djaym7 Thanks for saying something. I don't recall closing this and have reopened it.

StellaAthena commented 1 year ago

@FayZ676 the url you’re linking to does not contain the weights for a 1.3B model, it contains the weights for a 20B model. That’s why you’re getting a size mismatch: it’s quite simply the wrong size. I suspect that this is unrelated to the problems the others are having.

@leclem so that change allows you to finetune the 20B model? Can you post a WandB link showing it training so I can check out the loss etc are as expected?

shaunstoltz commented 10 months ago

I have the same issue trying to train. Downloaded slim weight and using ./config/20B.yml and running "python3 ./deepy.py train.py ./configs/20B.yml" gives this error:

RuntimeError: Error(s) in loading state_dict for EmbeddingPipe: size mismatch for word_embeddings.weight: copying a param with shape torch.Size([12608, 6144]) from checkpoint, the shape in current model is torch.Size([12672, 6144]).

dashstander commented 9 months ago

I suspect that this is an error that has to do with model parallelism. @shaunstoltz how many GPUs were you loading the model onto / what was the model parallelism setting?