artidoro / frank

FRANK: Factuality Evaluation Benchmark
MIT License
52 stars 4 forks source link

FactCC Scores - cant replicate scores #4

Open Lukecn1 opened 3 years ago

Lukecn1 commented 3 years ago

Hi Again :)

I was checking my own implementation of the factCC scoring you described in the paper against your data, and noticed that for 90 cases we derived different scores

I suspect this is due to difference in how we split summaries into their individual sentences prior to classification and scoring.

How did you split summary sentences for factCC scoring?

(I use nltk sent_tokenize function)

artidoro commented 3 years ago

Hello, I think you are right and this is probably due to differences in how we split sentences. I used spacy's sentence segmentation https://spacy.io/usage/linguistic-features#sbd

Let me know if the results are significantly different and we could investigate further otherwise we can close the issue.

Lukecn1 commented 3 years ago

Hello, I think you are right and this is probably due to differences in how we split sentences. I used spacy's sentence segmentation https://spacy.io/usage/linguistic-features#sbd

Let me know if the results are significantly different and we could investigate further otherwise we can close the issue.

The differences I have found is only for 90 cases so its not a massive difference already. But thanks for the reply, ill test it using spacy and get back to you :)

Lukecn1 commented 3 years ago

Hello, I think you are right and this is probably due to differences in how we split sentences. I used spacy's sentence segmentation https://spacy.io/usage/linguistic-features#sbd

Let me know if the results are significantly different and we could investigate further otherwise we can close the issue.

Using spacy i have 30 more differences in scores than using nltk. Do yo do any other preprocessing of data before scoring?