ZurichNLP / nmtscore

A library of translation-based text similarity measures
MIT License
25 stars 6 forks source link

How to re-run test script of nmtscore #6

Closed longmai-pinbus closed 1 year ago

longmai-pinbus commented 1 year ago

Hi @jvamvas

I'm doing a research on cross-lingual plagiarism identification and found your tool is so good. I tried to run your test script (with steps that you pointed out in README.md). image

When run the script with "prism", I ran into this error image

Then, I modified the model from prism to small100 and ran the script again, but it took me more than 12 hours waiting and nothing happen.

Could you tell me the exactly way to re-run the script and which format of dataset should I choose if I want to run test on cross-lingual plagiarism identification between English and Vietnamese?

Thanks,

jvamvas commented 1 year ago

Hi @longmai-pinbus, the script is for reproducing Table 2 in the paper. Yes, it takes some time and memory.

To evaluate cross-lingual plagiarism identification between English and Vietnamese I would recommend to create your own evaluation script.

Maybe along the lines of:

from nmtscore import NMTScorer

scorer = NMTScorer()
num_correct = 0
for sentence_en, sentence_vi, gold_label in my_dataset:
    similarity_score = scorer.score(sentence_en, sentence_vi)
    num_correct += (similarity_score > my_threshold) == gold_label
accuracy = num_correct / len(my_dataset)
jvamvas commented 1 year ago

Closing this, feel free to reopen