question about nli on test set

Jyun1998 commented 3 years ago

@nreimers

Hope you are doing well and having safe Christmas.

I'm a student excited about using sentence-transformers; it's an amazing library.

I'm currently doing project detecting fake news given title, context, is_fake(bool) as data.

Since I could not get more than 98% only using sequence classification on context embedding and label, I would like to add parameters using title-context NLI.

https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/nli/training_nli.py

I'm having a question about the code above.

After training the model, the code below is the benchmark test using a benchmark dataset.

How can I change this to evaluate test data that has title, context which predict is_fake?

nreimers commented 3 years ago

Hi @Jyun1998 This sounds like a classification task?

In that case, cross encoders are the right choice: https://www.sbert.net/examples/training/cross-encoder/README.html https://www.sbert.net/examples/applications/cross-encoder/README.html

By the way, 98% is extremely high for an NLP task.

Jyun1998 commented 3 years ago

@nreimers

Well noted thanks :)

It looks similar to Quora qa + spam detection

I have a few inquiries on cross-encoder example:

since it's binary classification on two sentence pairs, shoud I use CEBinaryClassificationEvaluator as evaluator?
is 'paraphrase-xlm-r-multilingual-v1' strongest model for multilingual model? (korean)
for dev_sample used in evaluator, can I use sklearn.model_selection.train_test_split to make validation set using part of train set?
```
evaluator = CEBinaryClassificationEvaluator.from_input_examples(dev_sample)
```

I believe code below is to benchmark model performance using test_sample. However, my test sample label is empty and want to predict lable of my test sample.

test_evaluator = CEBinaryClassificationEvaluator.from_input_examples(test_samples, batch_size=train_batch_size)
test_evaluator(model, output_path=model_save_path)

Will this work?

submission['label'] = model.predict(test_samples, verbose=1)
sub.to_csv('submission.csv', index=False)

nreimers commented 3 years ago

Yew
That model was tuned bi encoder to produce embeddings. Not as cross encoder
Yes
Yes

UKPLab / sentence-transformers

question about nli on test set #648