dwzhu-pku / LongEmbed

LongEmbed: Extending Embedding Models for Long Context Retrieval (EMNLP 2024)
114 stars 6 forks source link

Linear Interpolation without finetuning #7

Closed LithurshanK closed 2 days ago

LithurshanK commented 1 month ago

Hi, does the code repo include the code for interpolation without finetuning E5

dwzhu-pku commented 1 month ago

Hi @LithurshanK ! The current code repo does not include this part. But you can manually perform linear interpolation as follows:

model = AutoModel.from_pretrained("../models/intfloat/e5-base-v2")
tokenizer = AutoTokenizer.from_pretrained("../models/intfloat/e5-base-v2")

original_pos_len = 512
target_pos_len = 8192
hidden_size = 768
factor = target_pos_len // original_pos_len
original_pos_embeddings = model.embeddings.position_embeddings
new_pos_embeddings = nn.Embedding(target_pos_len, hidden_size)

for idx in range(original_pos_len):
    new_pos_embeddings.weight.data[idx*factor, :] = original_pos_embeddings.weight.data[idx, :].clone()
new_pos_embeddings.weight.data[(original_pos_len-1)*factor:, :] = original_pos_embeddings.weight.data[-1, :].clone()

for idx in range(original_pos_len-1):
    for j in range(factor-1):
        new_pos_embeddings.weight.data[idx*factor+j+1, :] = (original_pos_embeddings.weight.data[idx, :] * (factor-j-1) + original_pos_embeddings.weight.data[idx+1, :] * (j+1)).clone() / factor

model.config.max_position_embeddings = target_pos_len
model.embeddings.position_embeddings = new_pos_embeddings

model.save_pretrained("../models/dwzhu/e5-base-pi-8k")
tokenizer.save_pretrained("../models/dwzhu/e5-base-pi-8k")