How is Rationale-F1 calculated?

The paper mentions, "we compute the F1 score over candidate sentences in a scientific paper against the reference rationales. We denote this metric as Rationale-F1." So, I believe the authors mean the sentence-level F1 score. However, I found that in the data from "argscichat_allennlp/argscichat_train_dev," the sentence boundaries in "content" and "facts" do not match.

For example, in fold_0_test.json, there is a fact: "We run a battery of supervised machine learning models for automatically detecting parody tweets," but the corresponding sentence in "content" is "We run a battery of supervised machine learning models for automatically detecting parody tweets with an emphasis on robustness by testing on tweets from accounts unseen in training, across different genders and across countries."

So, why is there this inconsistency? If that's the case, how are the TF-IDF and S-BERT baselines mentioned in the paper implemented? Are their inputs the sentences from "content"? If so, it would be impossible to correctly calculate Rationale-F1 with the ground truth sentences provided in "facts," because the sentence boundaries are different.

I would greatly appreciate it if you could answer my questions above! Thanks!

UKPLab / acl2023-argscichat

How is Rationale-F1 calculated? #2