Kyubyong / wordvectors

Pre-trained word vectors of 30+ languages
MIT License
2.22k stars 393 forks source link

fasttext file format seems wrong #14

Open adodge opened 6 years ago

adodge commented 6 years ago

Thank you very much for this project. It seems very useful.

I don't seem to be able to use the fasttext files, at least not the Russian or Turkish ones. When attempting to load them with fasttext, I get this error:

$ fasttext print-word-vectors ru.bin
terminate called after throwing an instance of 'std::invalid_argument'
  what():  ru.bin has wrong file format!
Aborted

On closer inspection, the files are missing the fasttext magic number in their header. Fasttext binary files are expected to start with 0x2F4F16BA, and this one doesn't.

Were they created by some other software, or perhaps an older version of fasttext that had a different file format?

Thank you.

adodge commented 6 years ago

I did a little poking around in the fasttext history, and, yes, they had a different file format a year ago.

This is a script that will convert one of the old fasttext files to something the current version can read:

fasttext_file_update.py.txt

$ echo merhaba | fasttext print-word-vectors tr.bin2
merhaba 0.12206 0.066014 0.093112 -0.043492 0.5207 0.057019 0.20127 0.20933 0.057977 -0.29209 0.087561 0.05825 0.50264 -0.17409 0.19332 -0.08724 0.35125 0.045985 0.21882 0.1872 0.16603 0.21172 0.17046 0.062976 -0.022134 -0.50327 -0.064927 0.1336 0.10681 -0.1902 0.030359 -0.075208 -0.19389 0.40742 0.078176 0.11845 -0.057126 0.52497 0.11417 0.36205 -0.055332 -0.2492 0.46497 0.72146 0.42214 0.082853 0.035755 -0.1644 -0.23566 0.1037 -0.079192 0.15678 -0.14464 -0.023746 0.11418 0.21951 -0.20679 -0.11682 -0.020332 -0.07834 0.27913 -0.59613 -0.15867 0.15623 0.066335 0.078509 -0.0045359 -0.15227 -0.025417 -0.14899 -0.25298 0.2158 -0.26728 0.071114 -0.86768 -0.39044 -0.36575 0.053666 0.38771 0.3328 0.085293 -0.12563 0.13022 -0.21437 0.31115 0.013396 0.02462 -0.25962 -0.51704 -0.55816 0.43276 0.25894 -0.55603 0.3785 -0.13968 0.0031102 0.23232 0.11755 0.17286 -0.14933 0.19528 0.36565 -0.19717 0.066704 -0.20812 -0.32329 -0.09979 -0.34596 0.12763 -0.26259 -0.13747 -0.056275 0.47636 -0.068787 0.05284 -0.16213 -0.57922 -0.15148 0.31464 0.23883 -0.43305 0.21852 -0.082744 0.26875 -0.28505 -0.379 -0.24597 -0.11538 0.22466 -0.17107 0.047522 0.31911 0.15056 0.21347 0.16531 -0.078537 0.14234 0.090975 -0.4294 0.067041 0.085503 0.41908 0.18248 0.18221 0.10699 -0.21135 0.1343 -0.05573 -0.16256 -0.39946 0.086395 -0.030858 -0.66857 0.58846 0.17388 0.56812 -0.088791 -0.024312 -0.054497 -0.075219 -0.0048822 -0.17311 0.070715 0.080788 0.14496 0.45174 0.071725 -0.14704 0.56277 0.058342 0.67329 0.22379 -0.13657 -0.11677 0.31955 0.21028 -0.24803 -0.34743 0.0019436 0.26037 0.49244 0.2648 -0.07083 -0.26863 -0.24654 -0.025958 -0.27783 -0.045067 -0.068344 0.16087 0.11595 -0.044365 0.029121 0.12629 0.28304 0.23161 -0.17879 -0.092399 -0.38922 -0.24235
yaziciemre commented 5 years ago

somehow it does not work also


Traceback (most recent call last):
  File "fast_convert.py", line 57, in <module>
    m,n = struct.unpack("@qq", M[offset:offset+span])
struct.error: unpack requires a string argument of length 16