baichuan-inc / Baichuan2

A series of large language models developed by Baichuan Intelligent Technology
https://huggingface.co/baichuan-inc
Apache License 2.0
4.1k stars 297 forks source link

How to add new token to vocabulary and only fintune these embedings? #383

Open CongHan0808 opened 8 months ago

CongHan0808 commented 8 months ago

I follow #155, add 100 new tokens to the vocabulary and corresponding embedings. I try to only fintune these embeddings and fix raw tokens with pretrained weights. There are my code:

model_engine.backward(total_loss)        
if args.nums_token and args.mulgpu_numtoken and args.token_detach:
    textembeds_masks = torch.zeros_like(model_engine.in_adaptor.text_embed.weight).to(device=model_engine.local_rank)
    # textembeds_masks[VOCAB_SIZE_SRC+1,:] = 1
    with torch.no_grad():
        for p_name,param in model_engine.named_parameters():
            if "llm_model.base_model.model.model.embed_tokens.weight" in p_name:
                if param.grad is not None:
                    param.grad.copy_(param.grad.data * textembeds_masks)
            if "in_adaptor.text_embed.weight" in p_name:
                if param.grad is not None:
                    param.grad.copy_(param.grad.data * textembeds_masks)
model_engine.step()

in_adaptor.text_embed.weight is initialized by llm_model.base_model.model.model.embed_tokens.weight. After some checkpoints, the raw tokens' weights of in_adaptor.text_embed.weight in different cks are different. How should I change my code to keep the raw tokens' weights the same and only fintune the new tokens' weights.