dkpro / dkpro-c4corpus

DKPro C4CorpusTools is a collection of tools for processing CommonCrawl corpus, including Creative Commons license detection, boilerplate removal, language detection, and near-duplicate removal.
https://dkpro.github.io/dkpro-c4corpus
Apache License 2.0
50 stars 8 forks source link

Replacement GoldStandard 103.txt provided by Miloš Jakubíček - fixes #9 #16

Closed tfmorris closed 8 years ago

tfmorris commented 8 years ago

I asked on the CleanEval mailing list and Miloš Jakubíček @mjakubicek was able to find the original file and post it to https://downloads.sketchengine.co.uk/103.txt. This PR adds it back to the repo.

habernal commented 8 years ago

Thanks!