facebookresearch / fastText

Library for fast text representation and classification.
https://fasttext.cc/
MIT License
25.83k stars 4.71k forks source link

[Question] Is Crawl 300D 2M dataset all lowercase? #1290

Closed teohsinyee closed 2 years ago

teohsinyee commented 2 years ago

Source: https://www.kaggle.com/datasets/yekenot/fasttext-crawl-300d-2m

teohsinyee commented 2 years ago

Tho I can't find the answer from FastText official site. I found the answer from GloVe site. Here mentioned:

Common Crawl (42B tokens, 1.9M vocab, uncased, 300d vectors, 1.75 GB download): glove.42B.300d.zip Common Crawl (840B tokens, 2.2M vocab, cased, 300d vectors, 2.03 GB download): glove.840B.300d.zip

So the answer is - the dataset is cased. This means it combined Uppercase & lowercase. Also, I have uploaded the csv file to kaggle: https://www.kaggle.com/datasets/teohsinyee/word-of-common-crawl-cased-300d-vectors