Extend Pair Classification for entailments tasks

KennethEnevoldsen commented 1 year ago

It seems like (from the paper) that the current setup for pair classification relies upon computing similarity metrics for the pairs embeddings and then decide on a binary threshold. Naturally for tasks with more than two outcomes it soHowever, for tasks like entailment, I imagine the ideal would be to embed the two texts independently and then classify them.

Should I instead formulate an entailments tasks as a new task or would it be better to change the current pair classification task? (Naturally you can just concatenate the two sentences and make it a classification task)

Muennighoff commented 1 year ago

Not sure I understand why the current pair classification task does not work for Entailment? 🧐

embed the two texts independently and then classify them.

Isn't that essentially what it's doing? It embeds them independently and then uses a similarity metric to classify whether they are close.

KennethEnevoldsen commented 1 year ago

Right. Sorry for being unclear. The main point was that it was binary, while entailment is non-binary.

Muennighoff commented 1 year ago

Right. Sorry for being unclear. The main point was that it was binary, while entailment is non-binary.

But aren't the labels for your entailment dataset binary i.e. is entailed or not entailed? Or does it have three labels?

If the labels are binary then finding the best threshold like PairClassificatio should be fine I think.

KennethEnevoldsen commented 1 year ago

It is common to have three outcomes. Contradiction, entailment and neutral (it neither contradicts nor entails).

Muennighoff commented 1 year ago

It is common to have three outcomes. Contradiction, entailment and neutral (it neither contradicts nor entails).

I think if we add an NLI dataset the evaluation should be just like prior work (e.g. like in SBERT). IIURC it's the same as the STS evaluation, i.e. you have three labels that have an ordering (0=negtive, 1=neutral, 2=entailment) & then embed the sentences, compute cosine sim & use correlation metrics to score (just reuse the STS Evaluator) (cc @nreimers)

KennethEnevoldsen commented 1 year ago

I wouldn't say that there is an ordering. E.g. neutral can be something else entirely (Cat eating a fish -/-> Dog jumps over fence).

It seems like they use a concatenation of the sentence vectors along with a distance metric. The distance does seem to meaningfully change performance (table 6).

x-tabdeveloping commented 5 months ago

Option 1: Training a classifier over actual features:

I think one way we could make these tasks a bit more nuanced and natural is by using logistic regression over:
a) The two embeddings concatenated b) The difference vector of the two embeddings c) Concatenate the two embeddings and their absolute difference - This is the approach used by SentenceTransformers for training on NLI datasets

Option 2: Training a classifier (maybe tree based) on distances

We might not want to use all of the features as it would make the tasks more difficult and we might have to encode more training examples. We could still use distances in a manner that would at least be suitable for also accounting for neutral examples. Currently we have some heuristics for choosing the best thresholds, but this is only good for binary classification. I think by using a tree-based model (decision tree, random forest, etc.) we could obtain these implicitly and could utilize neutral examples still.

x-tabdeveloping commented 5 months ago

@KennethEnevoldsen suggested to go with Option 1, and I think it's a good idea. @Muennighoff green light?

x-tabdeveloping commented 5 months ago

I agree with Kenneth that neutral is a thrid thing entirely. I think even if we choose to go with something distance-based we could potentially account for this with Option 2. I would love to get more opinions on this.

KennethEnevoldsen commented 5 months ago

Isaac also agreed with 1 in the other issue.

Muennighoff commented 5 months ago

Sounds good to me then! Let's just make sure that existing PC scores remain reproducible & unchanged?

x-tabdeveloping commented 5 months ago

hmm, I guess we should write a new abstask then and then add new versions of tasks? Btw would it not make sense to have some sort of versioning on Abstasks as well? Considering that this is at least the second time (ClusteringFast) we're doing something like this we should think about how to do this in a smart way.

KennethEnevoldsen commented 5 months ago

@x-tabdeveloping yea, then would should rename the old one (we should do this for clustering as well). I would probably do:

ClusteringTask -> RedactedClusteringTask
ClusteringTaskFast -> ClusteringTask

as we don't want people to accidentally use the outdated version.

Muennighoff commented 5 months ago

Maybe instead of RedactedClusteringTask, ClusteringTaskV1 or something? In case we retire a task twice then we can have the next one be called V2 etc; Also I think we should only do this if the new task is better on all aspects; if there's reasons people might want to use the previous one (e.g. more accurate scores), then maybe we should rather name it e.g. ClusteringTaskSlow 🤔

KennethEnevoldsen commented 5 months ago

I am fine with either. We could also implement a test that fails if there is a new task that uses the "old" version.

x-tabdeveloping commented 5 months ago

@KennethEnevoldsen What's the status on this? Would we need to hardcode a test like that? I'm also thinking it would probably be a good idea to add superseeded_by to AbsTasks as well, that way we could have a meaningful way of expressing this. Potentially also deprecation warnings could be something we want?

KennethEnevoldsen commented 5 months ago

I believe you currently have a PR on extended pair classification. We already raise a warning when running on a superseded dataset I believe.

I would probably not add superseeded_by to the abstasks (unless in all cases they are superseded, in which case they already have an error). Then I would have tests that fail. (warning can be ignored failing tests can't)

embeddings-benchmark / mteb

Extend Pair Classification for entailments tasks #123

Option 1: Training a classifier over actual features:

Option 2: Training a classifier (maybe tree based) on distances