avidale / compress-fasttext

Tools for shrinking fastText models (in gensim format)
MIT License
165 stars 13 forks source link

Cannot load the compressed model with the Facebook executable #4

Closed ValleZ closed 3 years ago

ValleZ commented 3 years ago

./fastText/fasttext nn compressed.bin terminate called after throwing an instance of 'std::invalid_argument' what(): compressed.bin has wrong file format! Aborted

avidale commented 3 years ago

@ValleZ Could you please provide a piece of code that reproduces the error, and environment in which it reproduces (OS, Python version, versions of Python packages, etc.)?

ValleZ commented 3 years ago

Sure, I used the snippet from your example:

from gensim.models.fasttext import load_facebook_model import compress_fasttext big_model = load_facebook_model('original_model.bin').wv small_model = compress_fasttext.prune_ft_freq(big_model, pq=True) small_model.save('compressed.bin')

then I used fasttext executable to check it: ./fasttext nn compressed.bin

It's WSL2/Ubuntu, python 3.8.5, Package Version


attrs 19.3.0
Automat 0.8.0
bitarray 1.6.1
blinker 1.4
certifi 2019.11.28
chardet 3.0.4
Click 7.0
cloud-init 20.2
colorama 0.4.3
command-not-found 0.3
compress-fasttext 0.0.6
configobj 5.0.6
constantly 15.1.0
cryptography 2.8
dataclasses 0.6
dbus-python 1.2.16
distro 1.4.0
distro-info 0.23ubuntu1
entrypoints 0.3
fasttext 0.9.2
filelock 3.0.12
future 0.18.2
gensim 3.8.3
httplib2 0.14.0
hyperlink 19.0.0
idna 2.8
importlib-metadata 1.5.0
incremental 16.10.1
Jinja2 2.10.1
joblib 0.17.0
jsonpatch 1.22
jsonpointer 2.0
jsonschema 3.2.0
keyring 18.0.1
language-selector 0.1
launchpadlib 1.10.13
lazr.restfulclient 0.14.2
lazr.uri 1.0.3
lshash3 0.0.8
MarkupSafe 1.1.0
more-itertools 4.2.0
netifaces 0.10.4
numpy 1.19.4
oauthlib 3.1.0
packaging 20.7
pip 20.0.2
pipe 1.6.0
pqkmeans 1.0.4
protobuf 3.14.0
pyasn1 0.4.2
pyasn1-modules 0.2.1
pybind11 2.6.1
PyGObject 3.36.0
PyHamcrest 1.9.0
PyJWT 1.7.1
pymacaroons 0.13.0
PyNaCl 1.3.0
pyOpenSSL 19.0.0
pyparsing 2.4.7
pyrsistent 0.15.5
pyserial 3.4
python-apt 2.0.0+ubuntu0.20.4.1 python-debian 0.1.36ubuntu1
PyYAML 5.3.1
regex 2020.11.13
requests 2.22.0
requests-unixsocket 0.2.0
sacremoses 0.0.43
scikit-learn 0.23.2
scipy 1.5.4
SecretStorage 2.3.1
sentencepiece 0.1.91
service-identity 18.1.0
setuptools 45.2.0
simplejson 3.16.0
six 1.14.0
smart-open 4.0.1
ssh-import-id 5.10
systemd-python 234
texmex-python 1.0.0
threadpoolctl 2.1.0
tokenizers 0.9.3
torch 1.7.0
tqdm 4.54.0
transformers 3.5.1
Twisted 18.9.0
typing 3.7.4.3
typing-extensions 3.7.4.3
ubuntu-advantage-tools 20.3
ufw 0.36
unattended-upgrades 0.1
urllib3 1.25.8
wadllib 1.3.3
wheel 0.34.2
zipp 1.0.0
zope.interface 4.7.1

avidale commented 3 years ago

The saved binary is intended to use with this particular Python library, which is in turn a wrapper around Gensim. We never intended it to work with the Facebook FastText binary.

ValleZ commented 3 years ago

Then the description and project name is misleading. "This Python 3 package allows to compress fastText word embedding models" sounds like it keeps the original format.

ValleZ commented 3 years ago

Is there a way to convert compressed models back to fasttext format or maybe load it somehow to use with fasttext?

avidale commented 3 years ago

Then the description and project name is misleading. "This Python 3 package allows to compress fastText word embedding models" sounds like it keeps the original format.

The title reads "This Python 3 package allows to compress fastText word embedding models (from the gensim package)". How do you propose to reformulate it to show more explicitly that the package works with gensim?

avidale commented 3 years ago

Is there a way to convert compressed models back to fasttext format or maybe load it somehow to use with fasttext?

Not really: 2 out of 3 techniques, matrix decomposition and product quantization, are not supported by the original Facebook implementation, and if you want them to work with the Facebook executable, you'll need to patch and rebuild the executable itself.

With pruning, however, it might be possible to save a model in the format compatible with the Facebook FastText implementation. You'll need to use the save_facebook_model method, and maybe change some of the functions from the _fasttext_bin.py file. @ValleZ, it would be great if you tried it an maybe created a pull request with this enhancement.

ValleZ commented 3 years ago

Then the description and project name is misleading. "This Python 3 package allows to compress fastText word embedding models" sounds like it keeps the original format.

The title reads "This Python 3 package allows to compress fastText word embedding models (from the gensim package)". How do you propose to reformulate it to show more explicitly that the package works with gensim?

"fastText to gensim compressor" maybe?

ValleZ commented 3 years ago

I'll try the save_facebook_model, thanks!

ValleZ commented 3 years ago

No luck with save_facebook_model :-(

from gensim.models.fasttext import save_facebook_model save_facebook_model(small_model, "tor2_cmp.bin") Traceback (most recent call last): File "", line 1, in File "/home/valle/.local/lib/python3.8/site-packages/gensim/models/fasttext.py", line 1336, in save_facebook_model gensim.models._fasttext_bin.save(model, path, fb_fasttext_parameters, encoding) File "/home/valle/.local/lib/python3.8/site-packages/gensim/models/_fasttext_bin.py", line 668, in save _save_to_stream(model, fout_stream, fb_fasttext_parameters, encoding) File "/home/valle/.local/lib/python3.8/site-packages/gensim/models/_fasttext_bin.py", line 626, in _save_to_stream _args_save(fout, model, fb_fasttext_parameters) File "/home/valle/.local/lib/python3.8/site-packages/gensim/models/_fasttext_bin.py", line 506, in _args_save field_value = _get_field_from_model(model, field) File "/home/valle/.local/lib/python3.8/site-packages/gensim/models/_fasttext_bin.py", line 475, in _get_field_from_model return model.window AttributeError: 'FastTextKeyedVectors' object has no attribute 'window'