The websites the dataset was scraped from?

csebuetnlp / banglabert

This repository contains the official release of the model "BanglaBERT" and associated downstream finetuning code and datasets introduced in the paper titled "BanglaBERT: Language Model Pretraining and Benchmarks for Low-Resource Language Understanding Evaluation in Bangla" accpeted in Findings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics: NAACL-2022.

231 stars 31 forks source link

The websites the dataset was scraped from? #6

Closed imr555 closed 1 year ago

imr555 commented 2 years ago

As Alexa Web rankings shut down in May, 2022, (https://www.alexa.com/topsites/countries/BD), it is not possible to retrieve the names of the Bangladeshi websites used.

It would be really useful if the names of the fifty Bangladeshi websites used to scrape the dataset could be released. It would help understand the nature of the dataset used to train the model and help in model interpretability experiments too.

abhik1505040 commented 1 year ago

Pretraining data sources have been enumerated in the appendix of our paper.