Open KennethEnevoldsen opened 1 year ago
Not sure I understand why the current pair classification task does not work for Entailment? 🧐
embed the two texts independently and then classify them.
Isn't that essentially what it's doing? It embeds them independently and then uses a similarity metric to classify whether they are close.
Right. Sorry for being unclear. The main point was that it was binary, while entailment is non-binary.
Right. Sorry for being unclear. The main point was that it was binary, while entailment is non-binary.
But aren't the labels for your entailment dataset binary i.e. is entailed or not entailed? Or does it have three labels?
If the labels are binary then finding the best threshold like PairClassificatio should be fine I think.
It is common to have three outcomes. Contradiction, entailment and neutral (it neither contradicts nor entails).
It is common to have three outcomes. Contradiction, entailment and neutral (it neither contradicts nor entails).
I think if we add an NLI dataset the evaluation should be just like prior work (e.g. like in SBERT). IIURC it's the same as the STS evaluation, i.e. you have three labels that have an ordering (0=negtive, 1=neutral, 2=entailment) & then embed the sentences, compute cosine sim & use correlation metrics to score (just reuse the STS Evaluator) (cc @nreimers)
I wouldn't say that there is an ordering. E.g. neutral can be something else entirely (Cat eating a fish -/-> Dog jumps over fence).
It seems like they use a concatenation of the sentence vectors along with a distance metric. The distance does seem to meaningfully change performance (table 6).
I think one way we could make these tasks a bit more nuanced and natural is by using logistic regression over:
a) The two embeddings concatenated
b) The difference vector of the two embeddings
c) Concatenate the two embeddings and their absolute difference - This is the approach used by SentenceTransformers for training on NLI datasets
We might not want to use all of the features as it would make the tasks more difficult and we might have to encode more training examples. We could still use distances in a manner that would at least be suitable for also accounting for neutral examples. Currently we have some heuristics for choosing the best thresholds, but this is only good for binary classification. I think by using a tree-based model (decision tree, random forest, etc.) we could obtain these implicitly and could utilize neutral examples still.
@KennethEnevoldsen suggested to go with Option 1, and I think it's a good idea. @Muennighoff green light?
I agree with Kenneth that neutral is a thrid thing entirely. I think even if we choose to go with something distance-based we could potentially account for this with Option 2. I would love to get more opinions on this.
Isaac also agreed with 1 in the other issue.
Sounds good to me then! Let's just make sure that existing PC scores remain reproducible & unchanged?
hmm, I guess we should write a new abstask then and then add new versions of tasks? Btw would it not make sense to have some sort of versioning on Abstasks as well? Considering that this is at least the second time (ClusteringFast) we're doing something like this we should think about how to do this in a smart way.
@x-tabdeveloping yea, then would should rename the old one (we should do this for clustering as well). I would probably do:
ClusteringTask -> RedactedClusteringTask
ClusteringTaskFast -> ClusteringTask
as we don't want people to accidentally use the outdated version.
Maybe instead of RedactedClusteringTask
, ClusteringTaskV1
or something? In case we retire a task twice then we can have the next one be called V2 etc; Also I think we should only do this if the new task is better on all aspects; if there's reasons people might want to use the previous one (e.g. more accurate scores), then maybe we should rather name it e.g. ClusteringTaskSlow
🤔
I am fine with either. We could also implement a test that fails if there is a new task that uses the "old" version.
@KennethEnevoldsen What's the status on this? Would we need to hardcode a test like that? I'm also thinking it would probably be a good idea to add superseeded_by
to AbsTask
s as well, that way we could have a meaningful way of expressing this.
Potentially also deprecation warnings could be something we want?
I believe you currently have a PR on extended pair classification. We already raise a warning when running on a superseded dataset I believe.
I would probably not add superseeded_by to the abstasks (unless in all cases they are superseded, in which case they already have an error). Then I would have tests that fail. (warning can be ignored failing tests can't)
It seems like (from the paper) that the current setup for pair classification relies upon computing similarity metrics for the pairs embeddings and then decide on a binary threshold. Naturally for tasks with more than two outcomes it soHowever, for tasks like entailment, I imagine the ideal would be to embed the two texts independently and then classify them.
Should I instead formulate an entailments tasks as a new task or would it be better to change the current pair classification task? (Naturally you can just concatenate the two sentences and make it a classification task)