awslabs / sockeye

Sequence-to-sequence framework with a focus on Neural Machine Translation based on PyTorch
https://awslabs.github.io/sockeye/
Apache License 2.0
1.21k stars 323 forks source link

Fine Tuning on Different Data size mismatch for embedding_target #1097

Closed AmitMY closed 1 year ago

AmitMY commented 1 year ago

My training procedure is:

  1. Learn a BPE vocab over some large data
  2. Apply BPE on some large data
  3. Prepare data
  4. Train model

My fine tuning procedure is:

  1. Apply previous BPE on some small data
  2. Prepare data
  3. Train model with --params of the previous model

The large data has ~800 tokens apparently not existing in the small data, making the vocab size different (5976 compared to 5192) - These tokens are usually single Chinese characters

Error:

RuntimeError: Error(s) in loading state_dict for SockeyeModel:
    size mismatch for embedding_target.embedding.weight: copying a param with shape torch.Size([5976, 512]) from checkpoint, the shape in current model is torch.Size([5192, 512]).
    size mismatch for output_layer.weight: copying a param with shape torch.Size([5976, 512]) from checkpoint, the shape in current model is torch.Size([5192, 512]).
    size mismatch for output_layer.bias: copying a param with shape torch.Size([5976]) from checkpoint, the shape in current model is torch.Size([5192]).

Is there a way for me to overcome this difference? Keep the embedding layer have 5976 embeddings, and train only the 5192 that are in my data?

AmitMY commented 1 year ago

When using --source-vocab and --target-vocab in prepare_data, I solve the issue