FlagOpen / FlagEmbedding

Retrieval and Retrieval-augmented LLMs
MIT License
7.76k stars 565 forks source link

teacher score #949

Open drewskidang opened 4 months ago

drewskidang commented 4 months ago

I have a train dataset with query,pos,neg. Is there a script to include knowledge distulation for scoring pos and negs?

malongfei1993 commented 4 months ago

same question,bge embedding btw

malongfei1993 commented 4 months ago

FlagEmbedding\baai_general_embedding\finetune\data.py 73line def padding_score(self, teacher_score): group_size = None for scores in teacher_score: if scores is not None: group_size = len(scores) break if group_size is None: return None

    padding_scores = [100.0] + [0.0] * (group_size - 1)
    new_teacher_score = []
    for scores in teacher_score:
        if scores is None:
            new_teacher_score.append(padding_scores)
        else:
            new_teacher_score.append(scores)
    return new_teacher_score
staoxiao commented 4 months ago

You can use bge-reranker-v2 to compute scores for pos and neg, and use bge-m3 script to fine-tune models via distillation: https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/unified_finetune#2-data-format

liuslab commented 3 months ago

FlagEmbedding\baai_general_embedding\finetune\data.py 73line def padding_score(self, teacher_score): group_size = None for scores in teacher_score: if scores is not None: group_size = len(scores) break if group_size is None: return None

    padding_scores = [100.0] + [0.0] * (group_size - 1)
    new_teacher_score = []
    for scores in teacher_score:
        if scores is None:
            new_teacher_score.append(padding_scores)
        else:
            new_teacher_score.append(scores)
    return new_teacher_score

这里并没有给bge embedding 完善这个功能对吗?我没有找到继续的代码。 看到m3是支持的