DavidNemeskey / cc_corpus

Tools for compiling corpora from Common Crawl
GNU Lesser General Public License v3.0
12 stars 1 forks source link

Trafilatura #25

Closed DavidNemeskey closed 1 year ago

DavidNemeskey commented 1 year ago

Added support for Trafilatura-based boilerplate removal.