dleemiller / WordLlama

Things you can do with the token embeddings of an LLM
MIT License
1.25k stars 36 forks source link

Matryoshka Representations Evaluation #32

Open KyleSmith19091 opened 4 hours ago

KyleSmith19091 commented 4 hours ago

A bit of a naive question here, but going through the code for training the model, I saw that when setting up the SentenceTransformTrainer in the reduce_dimension.py file you setup the evaluators for each Matryoshka dimension, but then for the actual score you simply use the 'score' calculated for the first dimension, thus implying that the performance of the first vector at that dimension is indicative of the rest of the vectors and should be used as the evaluation.

evaluators = [
            EmbeddingSimilarityEvaluator(
                sentences1=stsb_eval_dataset["sentence1"],
                sentences2=stsb_eval_dataset["sentence2"],
                scores=stsb_eval_dataset["score"],
                main_similarity=SimilarityFunction.COSINE,
                name=f"sts-dev-{dim}",
                truncate_dim=dim,
            )
            for dim in self.config.matryoshka_dims
        ]
        return SequentialEvaluator(
            evaluators, main_score_function=lambda scores: scores[0]
        )

Now my question is sort of two-fold: Is this assumption that the first score is indicative of the rest correct(I'm guessing it is, since you've trained the model successfully), which leads to the second question which is, why you need the evaluators for the other dimensions as you are ignoring their scores it seems? My understanding was that you would need to combine the scores for the different dimensions in some way to get the 'true score'. Unless the score of the other dimensions are used internally by the SentenceTransformer library in some arbitrary way that has an impact?

Apologies if the question is a bit naive, just curious about the reasoning here.

dleemiller commented 4 hours ago

Good questions. I would need to look back at the logs from my training script to determine where main score comes in, but in truth, I was still able to see the evaluations for all of the dimensions in the logs so I didn't spend much time thinking about it.

My understanding is that you have to pick a score to be your "main score," or somehow decide how to combine them like you say. The default behavior is to use the last score:

https://github.com/UKPLab/sentence-transformers/blob/78553270abc74f44c1504db0e29f79591af6b697/sentence_transformers/evaluation/SequentialEvaluator.py#L25

I was checkpointing frequently and not relying on the evaluator to decide when to select the checkpoint, but if you wanted to train and be a little more hands off, then you'd probably want to think about what would make the best main score for doing that selection.

The only point where I think it comes into play is just in checkpoint selection, so I think the purpose is just to return the score that helps the platform decide.