centre-for-humanities-computing / odyCy

A general-purpose NLP pipeline for Ancient Greek
https://centre-for-humanities-computing.github.io/odyCy/
MIT License
17 stars 2 forks source link

check for duplicates in the treebanks #8

Closed jankounchained closed 1 year ago

jankounchained commented 1 year ago

e.g. Herodotus is in both treebanks.

In the worst case scenario, we would have Herodotus in both train.spacy and test.spacy. A report generated in #7 revelas there are no overlaps between train and dev subsets (if i understand it right).

x-tabdeveloping commented 1 year ago
I collected source tags from all conll comments and then created a matrix of set intersections between the sources in the dev, train and test sets. train dev test
train {Histories, Book 1, chapter 107\n, Histories, ...} {} {}
dev {} {Histories, Book 5, chapter 110\n, Histories, ...} {}
test {} {} {Histories, Book 1, chapter 21\n, Histories, B...}

Since only the diagonal is populated we can conclude that there is no overlap between either of the datasets.

jankounchained commented 1 year ago

even splitting the same book to have it in all 3 folds.. is spacy that smart, or are we lucky?