check for duplicates in the treebanks

centre-for-humanities-computing / odyCy

A general-purpose NLP pipeline for Ancient Greek

https://centre-for-humanities-computing.github.io/odyCy/

MIT License

17 stars 2 forks source link

check for duplicates in the treebanks #8

Closed jankounchained closed 1 year ago

jankounchained commented 1 year ago

e.g. Herodotus is in both treebanks.

In the worst case scenario, we would have Herodotus in both train.spacy and test.spacy. A report generated in #7 revelas there are no overlaps between train and dev subsets (if i understand it right).

x-tabdeveloping commented 1 year ago

I collected source tags from all conll comments and then created a matrix of set intersections between the sources in the dev, train and test sets.		train	dev
train	{Histories, Book 1, chapter 107\n, Histories, ...}	{}	{}
dev	{}	{Histories, Book 5, chapter 110\n, Histories, ...}	{}
test	{}	{}	{Histories, Book 1, chapter 21\n, Histories, B...}

Since only the diagonal is populated we can conclude that there is no overlap between either of the datasets.

jankounchained commented 1 year ago

even splitting the same book to have it in all 3 folds.. is spacy that smart, or are we lucky?