dkpro / dkpro-c4corpus

DKPro C4CorpusTools is a collection of tools for processing CommonCrawl corpus, including Creative Commons license detection, boilerplate removal, language detection, and near-duplicate removal.
https://dkpro.github.io/dkpro-c4corpus
Apache License 2.0
50 stars 8 forks source link

Add use-case example: search for patterns in C4Corpus #26

Closed habernal closed 8 years ago

habernal commented 8 years ago

Some simple search for regex occurrences would be nice.

habernal commented 8 years ago

I guess a simple regex is just fine for the moment.