facebookresearch / fastText

Library for fast text representation and classification.
https://fasttext.cc/
MIT License
25.85k stars 4.71k forks source link

Binary model that was trained on Common crawl #428

Closed MrBoor closed 6 years ago

MrBoor commented 6 years ago

Hello! I enjoy using your library and pretrained vectors. I see that for vectors that were trained on wiki you provide both binary model and pretrained vectors. However, for vectors that were trained on Common crawl, you only provide pretrained vectors. Is it possible for you to publish binary model for them?

Thanks, Alexander.

orech commented 6 years ago

That would be very helpful for me as well

JovanVeljanoski commented 6 years ago

I would also very much appreciate it if you could publish the binary model. Thanks!

rboyes commented 6 years ago

Yes it would be very useful

rboyes commented 6 years ago

For the english link you post above, they only contain the word vectors, not the model .bin files, which is what we are asking for.

With the model files, we can create out of vocabulary word vectors, but we can't do that with the word vectors only.

phdowling commented 6 years ago

Also interested in this. The bin files for english would be very valuable.

m09 commented 6 years ago

I would also be interested in the binary vectors.

Schneitzer commented 6 years ago

Is there a reason why the .bin file will not be made open to the public?

It would be really helpful to be able to generate OOV word vectors for English words, but without the .bin file this would not be possible.

maxfriedrich commented 6 years ago

I found a link to an English .bin in the comments of #494: https://s3-us-west-1.amazonaws.com/fasttext-vectors/wiki-news-300d-1M-subword.bin.zip

Schneitzer commented 6 years ago

Thank you maxfriedrich.

However, I think most of us would like to see the .bin file on the Common Crawl corpus. The link you provided only contains the vectors trained on the Wikipedia and News, but not on the Common Crawl.

I'm currently working on text classification tasks on Tweets, so it would be nice to have the Common Crawl vectors. Hope it will be published later.

rktamplayo commented 6 years ago

Any update on this? I hope an admin at least assign someone to answer our queries...

thusithaC commented 6 years ago

This is indeed strange. For non English languages, the common crawl binaries are available but for English (which is most widely used) it is missing?

yuchsiao commented 6 years ago

Just check in back to see if there is any plan to release the common crawl version of binaries for English. Any update?

vdpappu commented 6 years ago

just popping this up. checking if we could bet the binaries for commoncrawl

EdouardGrave commented 6 years ago

Hi all,

Thank you for raising this issue.

The model trained on the common crawl data did not use subwords, and thus the binary model would not contain anymore information compared to the text file that we released. In particular, this binary model could not be used to compute representation for out of vocabulary words. This is the reason why we did not release the binary model.

We will likely release a model trained on crawl data with subwords in the near future (both binary and text models will be released).

Best, Edouard.

thusithaC commented 5 years ago

@EdouardGrave Hi Edo, Any update on the sub-word model trained on the common crawl?