Closed jankounchained closed 1 year ago
I collected source tags from all conll comments and then created a matrix of set intersections between the sources in the dev, train and test sets. | train | dev | test | |
---|---|---|---|---|
train | {Histories, Book 1, chapter 107\n, Histories, ...} | {} | {} | |
dev | {} | {Histories, Book 5, chapter 110\n, Histories, ...} | {} | |
test | {} | {} | {Histories, Book 1, chapter 21\n, Histories, B...} |
Since only the diagonal is populated we can conclude that there is no overlap between either of the datasets.
even splitting the same book to have it in all 3 folds.. is spacy that smart, or are we lucky?
e.g. Herodotus is in both treebanks.
In the worst case scenario, we would have Herodotus in both train.spacy and test.spacy. A report generated in #7 revelas there are no overlaps between train and dev subsets (if i understand it right).