FlagOpen / FlagEmbedding

Retrieval and Retrieval-augmented LLMs
MIT License
6.35k stars 456 forks source link

teacher score #949

Open drewskidang opened 4 weeks ago

drewskidang commented 4 weeks ago

I have a train dataset with query,pos,neg. Is there a script to include knowledge distulation for scoring pos and negs?

malongfei1993 commented 4 weeks ago

same question,bge embedding btw

malongfei1993 commented 4 weeks ago

FlagEmbedding\baai_general_embedding\finetune\data.py 73line def padding_score(self, teacher_score): group_size = None for scores in teacher_score: if scores is not None: group_size = len(scores) break if group_size is None: return None

    padding_scores = [100.0] + [0.0] * (group_size - 1)
    new_teacher_score = []
    for scores in teacher_score:
        if scores is None:
            new_teacher_score.append(padding_scores)
        else:
            new_teacher_score.append(scores)
    return new_teacher_score
staoxiao commented 3 weeks ago

You can use bge-reranker-v2 to compute scores for pos and neg, and use bge-m3 script to fine-tune models via distillation: https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/unified_finetune#2-data-format