DavidNemeskey / cc_corpus

Tools for compiling corpora from Common Crawl
GNU Lesser General Public License v3.0
12 stars 1 forks source link

Resulting (downloaded & filtered/processed) dataset available? #48

Closed arpitest closed 1 year ago

arpitest commented 1 year ago

I've found this but looks very old: https://nessie.ilab.sztaki.hu/~ndavid/Webcorpus2_text/

do you have an updated version from 2022/2023 commoncrawl and using the updated scripts/processing pipeline from this repo?

DavidNemeskey commented 1 year ago

@arpitest We do, but we are not allowed to publicly share it due to legal reasons. As far as I understand, however, there is an exemption if you intend to use it for research only. Drop me a line and we can discuss the details (see e.g. the top of this paper).