IndoNLP / nusa-crowd

A collaborative project to collect datasets in Indonesian languages.
Apache License 2.0
260 stars 60 forks source link

Create dataset loader for Word frequency distribution #343

Open SamuelCahyawijaya opened 1 year ago

SamuelCahyawijaya commented 1 year ago

NusaCatalogue: https://indonlp.github.io/nusa-catalogue/card.html?freq_dist_id

Dataset freq_dist_id
Description Word frequency lists compiled from four different sources: Kompas, Wikipedia, Twitter, and Kaskus. Top 10,000 most frequent words per source, along with statistical distribution (Zipf graph).
License Unknown