Closed arpitest closed 1 year ago
@arpitest We do, but we are not allowed to publicly share it due to legal reasons. As far as I understand, however, there is an exemption if you intend to use it for research only. Drop me a line and we can discuss the details (see e.g. the top of this paper).
I've found this but looks very old: https://nessie.ilab.sztaki.hu/~ndavid/Webcorpus2_text/
do you have an updated version from 2022/2023 commoncrawl and using the updated scripts/processing pipeline from this repo?