BIDData / BIDMach

CPU and GPU-accelerated Machine Learning Library
BSD 3-Clause "New" or "Revised" License
916 stars 168 forks source link

BidMach word2vec #96

Open fanganjie opened 8 years ago

fanganjie commented 8 years ago

Could you give an example how to use BidMach word2vec by CPU and GUP respectively?

Thanks in advance, Andy.

MiguelAngelRG commented 7 years ago

Hi everyone, I am trying to use word2vec with my own corpus but I have some issues when I try to load it. Could I have a small example to check what is wrong on mine?

Thanks in advance,

Miguel Angel.

tmsimont commented 7 years ago

I'm also not sure how to get Word2Vec working with Bidmach, so any help here would be appreciated.

tmsimont commented 7 years ago

I found a test file that seems to point to some example of Word2Vec use: https://github.com/BIDData/BIDMach/blob/07969c1cc95180ac0b3d8a59e299de8359cb7dec/scripts/getw2vdata.sh

However, this also seems to depend on tparse2.exe

What is tparse2.exe? Is there something that would work in its place on linux?

MiguelAngelRG commented 7 years ago

Hi Tmsimont,

tparse2.exe is an executable compiled https://github.com/BIDData/BIDMach/blob/07969c1cc95180ac0b3d8a59e299de8359cb7dec/src/main/C/newparse/tparse2.cpp

its main objective is to change the representation of the words into a matrix of two rows, which is the exactly representation that you need to utilise this library.

tmsimont commented 7 years ago

Hmm after using maven to build I see that have tparse.exe compiled. When I try to run this script, however, I still get errors. It seem that fmt.txt is missing. Any idea where that comes from?

tmsimont commented 7 years ago

Nevermind.. it seems just about everything in that shell script is out of date and missing files. This seems to be leading me further away from figuring out how to get this to work.

I'd love to know just how to get a simple training process running on the text8 corpus, or any corpus for that matter. Is there no documentation anywhere on how to get word2vec up and running?

MiguelAngelRG commented 7 years ago

Hi Tmsimont, The "fmt.txt" is a format file, you have more information about it in this webpage https://github.com/BIDData/BIDMach/wiki/Data-Wrangling The aim of this file is to tell tparse2.exe what is the format of your text.

Yes I understand how do you feel, I have spent a lot of time to understand how this library works. It is quite painful. Firstly, I tried to compile by myself, but it was impossible a lot of dependencies, libraries and so on.

I hope this helps you

DanielTakeshi commented 7 years ago

Sorry about this. If you're looking for installation instructions, the way to do so is using maven now. See the Installing and Running wiki.

Unfortunately I've never used word2vec, and @jcanny and us are busy with papers at the moment.

tmsimont commented 7 years ago

Thanks @MiguelAngelRG I'll take a look.

@DanielTakeshi the installation isn't really a problem. It's actually using BIDMach to learn word vectors that's the issue. Once BIDMach is installed I can run ./bidmach and get a scala interpreter... but then what do you do to get Word2Vec to train on a corpus?

Specifically, in the paper Machine Learning at the Limit -- there is a report of 8.5M word/s on NVIDIA Titan-X. I'm trying to reproduce this on a Tesla P100 to see what the speeds are, but I have no idea how BIDMach was used to train a data set so quickly, or what data was trained at that speed, what format was the data was in, etc..

I'm going through a lot of BIDMach documents but I don't see much on how word2vec can be used within the framework.

MiguelAngelRG commented 7 years ago

@DanielTakeshi thanks for the clarification, now is working but it takes me a lot of time to solve it. The problem was that I only wanted to use it, and in this documentation does not say that it is required to install it in order to use it. That is why this was my last option. But then, after installing it was "easy" to make it work. On the other hand, I think it would be nice to have more documentation about how to work with this Word2Vec, specially with the Parallel version. @tmsimont no worries . it has been a pleasure. I hope I helped you.

Thank you @DanielTakeshi

MiguelAngelRG commented 7 years ago

@jcanny @DanielTakeshi I have one question.. how many cores do I need to have the same performance that you got in the paper titled:"Machine Learning at the Limit" I am doing some experiments with 2 o 3 nodes with 14x2 threads each + 3 GPUs and the BIDMach implementation needs a lot of time to generate the embeddings.

MiguelAngelRG commented 7 years ago

@tmsimont have you been able to achieve the speed that it is reported in "Machine learning at limit" I am using better GPUs and more cores and I have not even been closer to this performance. I have checked another frameworks and they are faster than BIDMach. :(

tmsimont commented 7 years ago

@MiguelAngelRG No -- I have not been able to get anything to run... I have written a few of my own custom kernels and no matter what I do I can only get about 6M words per second on a Tesla P100. I'm looking at the CUDA kernel code in BidMach and not sure how it would squeeze out an additional 2M words/sec, unless there are parameters that are much different (less negative samples, or something else...) It's hard to say since many parameters were not reported in the paper, and I can't figure out how to get a basic example to run.

MiguelAngelRG commented 7 years ago

@tmsimont I do not know either, to make it work, these are the scripts that I have used:

tmsimont commented 7 years ago

@MiguelAngelRG Thanks -- I was toying with these files, too, but couldn't figure out what the contents of fmt.txt should be. I know this file designates the format of the data that Word2Vec needs, but I'm not sure what it should be.

What does your fmt.txt look like?

MiguelAngelRG commented 7 years ago

@tmsimont it depends on the text that you want to use. Here you can see the different parameters that you can use to build your formatter txt.

https://github.com/BIDData/BIDMach/wiki/Data-Wrangling

MiguelAngelRG commented 7 years ago

@tmsimont Also here you have different format that they provide: https://github.com/BIDData/BIDMach/blob/master/data/rcv1_fmt.txt https://github.com/BIDData/BIDMach/blob/master/data/uci_fmt.txt https://github.com/BIDData/BIDMach/blob/master/data/uci_wfmt.txt

I hope this help you.

tmsimont commented 7 years ago

Thanks, @MiguelAngelRG

I saw these files you've linked to previously and it doesn't seem like any of them would properly describe the input structure of the typical Word2Vec training data set.

The 1-billion-word files that are linked in that shell script are in the form of a single sentence per line. (As are many other similar training sets)

I've read these a few times now: https://github.com/BIDData/BIDMach/wiki/Data-Wrangling#word-fields https://github.com/BIDData/BIDMach/wiki/Data-Wrangling#string-fields

It says for a schema word sometext

The word field type produces a single numeric token for each field, using a dictionary. e.g. "the white house" encodes as the literal string "the white house".

So does that mean each sentence has its own token? That doesn't really make any sense for Word2Vec. Word2Vec should tokenize words in a sentence, and then learn by examining surrounding word contexts.

Furthermore... what is sometext? The uci_wfmat.txt says word term and not word sometext... What does this mean?

I tried using the format in uci_wfmat.txt, and the getw2vdata.sh script seems to hang after

...
Processing news.en-00097-of-00100
305532 lines processed
Processing news.en-00098-of-00100
306180 lines processed
Processing news.en-00099-of-00100
305893 lines processed

Were you able to get past the tparse2 call in that script? I'm not sure what is going on when the script stalls out on me.

tmsimont commented 7 years ago

Oops I posted too soon. The script does finally get passed the tparse execution...

Then if fires off the getw2vdata.ssc only to trigger a bunch of file not found errors... It looks like it's searching for sbmat.gz files that were not created during the tparse exeution... oy...

tmsimont commented 7 years ago

OK... so... word term apparently named some of the outputs of tparse as ...term.. and term...

So I changed those to match the format I see in getw2vdata.ssc and lo and behold the File Not Found errors disappear... only to yield more errors:

Loading getw2vdata.ssc...
Switched off result printing.
java.lang.NegativeArraySizeException
  at BIDMat.SBMat$.SnoRows(SBMat.scala:144)
  at BIDMat.HMat$.loadSBMat(HMat.scala:999)
  at BIDMat.HMat$.loadSBMat(HMat.scala:980)
  at BIDMat.MatFunctions$.loadSBMat(MatFunctions.scala:2021)
  ... 54 elided
<console>:29: error: not found: value words0
       val words = words0(ii);
                   ^
<console>:29: error: not found: value words
              saveCSMat("../data/word2vec/data/dict.csmat.lz4", words);
                                                                ^
<console>:30: error: not found: value words
       val map1 = dict2 --> Dict(words);
                                 ^
java.lang.RuntimeException: col index out of range 1 1
  at BIDMat.DenseMat$mcI$sp.gapply$mcI$sp(DenseMat.scala:456)
  at BIDMat.DenseMat$mcI$sp.gapply$mcI$sp(DenseMat.scala:498)
  at BIDMat.IMat.apply(IMat.scala:96)
  at .repartition(<console>:47)
  ... 60 elided
<console>:36: error: not found: value map1
                          map1,
                          ^
Switched on result printing.

Any idea what that is about?

MiguelAngelRG commented 7 years ago

@tmsimont Firstly, the format file depends on how you want to process the training files. If you want to use lines or words, the format file will be different. Secondly, I think that the problem that you are having is the path is not correct (I guess) and also are you using bidmat or bidmach to execute Word2Vec, because you need bidmach. I think that there is some libraries that have not included in the path, could it be?

tmsimont commented 7 years ago

I've tried different training files and double checked paths, etc. The core of the problem I'm facing now seem to be the failure of this: val words0 = CSMat(loadSBMat(dir + "sentence.sbmat.gz"));

When that loadSBMat() function hits the output of tparse2 it chokes:

java.lang.NegativeArraySizeException
  at BIDMat.SBMat$.SnoRows(SBMat.scala:144)
  at BIDMat.HMat$.loadSBMat(HMat.scala:999)
  at BIDMat.HMat$.loadSBMat(HMat.scala:980)
  at BIDMat.MatFunctions$.loadSBMat(MatFunctions.scala:2021)
  ... 54 elided

I'm assuming the fmt.txt file I've used is incorrect, which again leaves me back where I started. What is this file supposed to look like?

@MiguelAngelRG What data set and fmt.txt are you using? Does your data file contain line breaks? Are the sentences delimited by period or line break, or are they all mashed together into 1 line (like the text8 file)?

MiguelAngelRG commented 7 years ago

@tmsimont I believe that the problem that you are having is the fmt.txt file. In my case I am using text file and I am considering each word instead of complete lines or group of words. Of course my data contains line breaks "\n". If you look at the implementation of tparse2 there is d parameter (delimiter parameter) that you can use it to specify this. In addition to this, be sure that the file that you are generating by using tparse2 fit with the input of the instruction that you mentioned.

tmsimont commented 7 years ago

@jcanny Hey I see you were just making changes to the w2v scripts: https://github.com/BIDData/BIDMach/commit/3addc346bfc7b8c40e84c6b1824773ea179d2ce6

EDIT -- (accidentally submitted post before finishing) Can you share your fmt.txt that you are using in these scripts on the 1-billion-word benchmark?

jcanny commented 7 years ago

Hi, I just checking in the missing fmt.txt. Its in BIDMach/data/word2vec/raw. If you pull from master, you will get it. Sorry, its not part of our normal test set, and didnt get included.

Everything else should be working. From BIDMach/scripts, run ./getw2vdata.sh to download and prepare the data files.

Then also from BIDMach/scripts, you can run ../bidmach testword2vec.ssc to build a W2V model pair.

I just tried both scripts and they worked fine on a vanilla ubuntu 14.04 machine.

With a Titan-X (Maxwell) and i7-2660 CPU, I just saw 8m words/sec on that dataset and script.

With a Titan-X Pascal GPU and i7-5930 CPU, I'm getting 11m words/sec on the same dataset.

tmsimont commented 7 years ago

@jcanny THANK YOU! That does the trick. I will play around and report back here any more questions/issues related to this and the speed I get, too. I'm guessing @MiguelAngelRG is having speed issues because of the way fmt.txt was built?

jcanny commented 7 years ago

fmt.txt contains separators for words on the lines of the input file. It looks something like this:

string sentence
<tab><space>,./?()%$#

String specifies a format with multiple words per line of input. "sentence" is the prefix of the data files that are created. The output is a 2 x m integer matrix with a word id and a sentence number in each column. The separators on the second line of the fmt file specify how to tokenize the input and help avoid junk words in the dictionary. And it there is no space on the second line, it wont split at spaces and there will be some very strange long tokens in the dictionary.

jcanny commented 7 years ago

Also @MiguelAngelRG, similar to the standard word2vec, you'll get better performance if your dictionary is sorted in descending order of word frequency. That gives better utilization of the GPUs two caches.

MiguelAngelRG commented 7 years ago

@jcanny to perform my experiments I have used the scripts that you have put in the github repository: https://github.com/BIDData/BIDMach/blob/master/scripts/getw2vdata.ssc#L54 https://github.com/BIDData/BIDMach/blob/master/scripts/testword2vec.ssc https://github.com/BIDData/BIDMach/blob/master/scripts/testword2vecp.ssc As you know, this scripts has this function "val (vv, ii) = sortdown2(cnt0);" which I think this is the instruction that you refer to. However, the performance is incredibly bad. I give you an example, to get the embeddings from a vocabulary of 10868, BIDMach requires (45 min), however other implementations like Gensim or TensorFlow achieve the same goal in less than 15 minutes. So, could you please tell me if there is something that I forgot to take into account?

Thank you very much in advance.

MiguelAngelRG commented 7 years ago

@tmsimont please, if you are able to get the same performance that was published in the paper, let me know, because I am working intensively in order to know what is the problem? Thank you very much in advance.

jcanny commented 7 years ago

@MiguelAngelRG, please make a detailed issue report. Here are some guidelines: https://sifterapp.com/blog/2012/08/tips-for-effectively-reporting-bugs-and-issues/ We need some basic information to figure out what's not working, and to be able to reproduce it. Its sounds a setup/installation problem, so the more information you can give us, the better chance we have to figure out what's going on. At a minimum include a trace of the output of the learner while its running.

If you have trouble with any other benchmarks, please provide a detailed report - at least the output trace, and how it was different from your expectations.

tmsimont commented 7 years ago

@jcanny regarding the word2vec example in the test script you posted... I see there is saveGoogleW2V in the Word2Vec.scala file. I think I have got the example you uploaded re-worked to take in the text8 corpus, but I'd like to evaluate the vectors with Hyperwords to be sure. Can you explain how this function is supposed to be used? I'm trying to do the following with no success:

import BIDMach.networks.Word2Vec

val mdir = "../data/word2vec/data/"

val (nn, opts) = Word2Vec.learner(mdir+"t8lines.imat.lz4");

opts.nstart = 0;
opts.nend = 7;
opts.npasses = 4;
opts.batchSize = 1000000;
opts.lrate = 1e-4f
opts.vexp = 0.5f
opts.nreuse = 5 
opts.dim = 300 
opts.vocabSize = 1000000

opts.useGPU = true;
//opts.autoReset = false;
//Mat.useMKL = false;

nn.train

val mod = nn.model.asInstanceOf[Word2Vec]

Word2Vec.saveGoogleW2V(loadCSMat(mdir+"t8lines.imat.lz4"), FMat(mod.modelmats(0)), "../data/word2vec/vectors-f8.txt", true);
jcanny commented 7 years ago

That looks right. But are you sure you want a Google format file?

tmsimont commented 7 years ago

Hmm perhaps not. All I really want is the input binary for the vectors and the vocabulary. How would I generate the input for Omar Levy's Hyperwords?

Something similar to the original Word2Vec's -binary 1 -output vectors.bin -save-vocab vocab.txt arguments?

tmsimont commented 7 years ago

I've tried to adjust the saveGoogleW2V call and get this:

  at BIDMat.DenseMat.apply(DenseMat.scala:58)
  at BIDMach.networks.Word2Vec$.saveGoogleW2V(Word2Vec.scala:1360)

Adjusted call:

val mod = nn.model.asInstanceOf[Word2Vec]
Word2Vec.saveGoogleW2V(loadCSMat(mdir+"t8dict.csmat.lz4"), FMat(mod.modelmats(0)), "../data/word2vec/vectors-f8.txt", true);

The t8dict.csmat.lz4 file comes from a script similar to getw2vdata.ssc, where I call:

val dir = "../data/word2vec/t8tokenized/";
val words0 = CSMat(loadSBMat(dir + "sentence.sbmat.gz"));
val cnt0 = loadIMat(dir + "sentence.cnt.imat.gz");
val (vv, ii) = sortdown2(cnt0);
val words = words0(ii);
saveCSMat("../data/word2vec/data/t8dict.csmat.lz4", words);
saveDMat("../data/word2vec/data/t8dict.dmat.lz4", DMat(vv));
val map0 = invperm(ii);

repartition(dir+ "t8lines_sentence.imat.gz",
            "../data/word2vec/data/t8lines.imat.lz4",
            map0,
            0, 1, size);
tmsimont commented 7 years ago

Sorry -- last post was missing the full error:

mod: BIDMach.networks.Word2Vec = BIDMach.networks.Word2Vec@3035c979
java.lang.IndexOutOfBoundsException: 253854 >= (253854)
  at BIDMat.DenseMat.apply(DenseMat.scala:58)
  at BIDMach.networks.Word2Vec$.saveGoogleW2V(Word2Vec.scala:1360)
  ... 54 elided
jcanny commented 7 years ago

We havent touched the word2vec code for a while. At that time, Google word2vec used to output a mixed ascii-binary file, which is what saveGoogleWord2Vec tries to do. Hopefully that's deprecated now.

If you tell me what's supposed to be in vectors.bin and vocab.txt, it should be easy to reproduce them.

tmsimont commented 7 years ago

@jcanny The vocab.txt file is written here: https://github.com/svn2github/word2vec/blob/master/word2vec.c#L298 It looks like for each word, there is a line in the format: string word representation of it count

The binary file is written as follows:

for (a = 0; a < vocab_size; a++) {
      fprintf(fo, "%s ", vocab[a].word);
      if (binary) for (b = 0; b < layer1_size; b++) fwrite(&syn0[a * layer1_size + b], sizeof(real), 1, fo);
      else for (b = 0; b < layer1_size; b++) fprintf(fo, "%lf ", syn0[a * layer1_size + b]);
      fprintf(fo, "\n");
}

This does seem to be pretty much what your saveGoogleW2V function does, right?

jcanny commented 7 years ago

Yes, I just tried saveGoogleWord2Vec and it seemed to work fine. I updated the script to call it in the test script.

Did you check the size of your vocabulary? e.g. do "words.length" from the command line.

Make sure opts.vocabSize is not larger than your actual dictionary size. that will cause problems.

tmsimont commented 7 years ago

Getting closer!

There are still a few things that are a little confusing here:

1) How can I get the corresponding vocab.txt? (for each word in vocab: string count\n in an order matching the vectors.bin output). I can see how I could do this by creating a function similar to that saveGoogleW2V function, but I feel like this data is already here somewhere... maybe in one of these lz4 files? (When I try to decompress those lz4's with lz4 -d I get an error about a bad header...)

2) What is that words.length value parameter? Intuitively it seems like this is a count of the number of words, but something is odd here.. The output of words.length is 253854 -- that seems reasonable for the text8 corpus, but when I wc -l output.bin on the output of saveGoogleW2V the ouptut vector file has 1331418 lines -- that's a lot more than the number of words in the vocabulary, so it seems something is wrong?

EDIT: nevermind on that first question... I can work around this.

tmsimont commented 7 years ago

My apologies... The line count is disrupted by the binary output. If I pass in false to the binary parameter it seems to work. Almost there...

tmsimont commented 7 years ago

I tried the test and it works, but the resulting vectors have terrible accuracy in they Hyperwords tests.

The word similarity score on the text8 set is only 1.4% Word analogy on text8 is 0%

For the billion word data set it's only 24% for word similarity test and 7.5% on the word analogy test.

By contrast, the original C implementation has the following scores: Text8: 64% on the word similarity 17% analogy Billion word: 70% word similarity 73.5% analogy score text8

(more about these tests in intel's paper on word2vec and omar's bitbucket page)

Have you used anything like this to evaluate the output in the past? Accuracy evaluation wasn't in the "Machine learning at the limit" paper, so I've been curious about these metrics on the output of something created in batches.

I see this in the output of ./bidmach testword2vec.ssc:

score: BIDMat.FMat = -1.1464

I'm not sure what that means, but is that consistent with what you're getting? Could I be doing something wrong with my parameters to get such bad results?

Here's the script that I ran (should be right out of the repo, but I did change from binary to non-binary format on save):

import BIDMach.networks.Word2Vec

val mdir = "../data/word2vec/data/"

val (nn, opts) = Word2Vec.learner(mdir+"train%05d.imat.lz4");

opts.nstart = 0;
opts.nend = 7;
opts.npasses = 4;
opts.batchSize = 1000000;
opts.lrate = 1e-3f
opts.vexp = 0.5f
opts.nreuse = 5
opts.dim = 300
opts.vocabSize = 100000

opts.useGPU = true;
//opts.autoReset = false;
//Mat.useMKL = false;

nn.train

val mod = nn.model.asInstanceOf[Word2Vec]

//saveFMat(mdir+"model0.fmat.lz4", FMat(mod.modelmats(0)))

//saveFMat(mdir+"model1.fmat.lz4", FMat(mod.modelmats(1)))

val test = loadIMat(mdir+"test00000.imat.lz4");

val (mm,mopts) = Word2Vec.predictor(mod, test);

mopts.useGPU = opts.useGPU
mm.predict

val score = mean(mm.results(0,0->(mm.results.ncols-2)));

val dict = loadCSMat(mdir+"dict.csmat.lz4");

Word2Vec.saveGoogleW2V(dict, FMat(mod.modelmats(0)), mdir+"googmodel.bin", false);
jcanny commented 7 years ago

Most likely its a difference in tokenization/vocabulary. Let me know exactly what you ran to get scores for both systems, so I can resolve the differences.

tmsimont commented 7 years ago

To generate the billion word results I just ran your test scripts. The only thing I changed was the 3rd parameter on the saveGoogleW2V call to false.

After that, I ran this shell script:

code=hyperwords
src_model=$1
model=model

cp $src_model $model.words

python2 $code/hyperwords/text2numpy.py $model.words

echo "WS353 Results"
echo "-------------"
python2 $code/hyperwords/ws_eval.py embedding $model $code/testsets/ws/ws353.txt
echo

echo "Google Analogy Results"
echo "----------------------"
python2 $code/hyperwords/analogy_eval.py embedding $model $code/testsets/analogy/google.txt
echo

(called as ./eval.sh googmodel.bin)

Note this script calls on the hyperwords python code (point the code variable to the source)

This test script is from Intel's research repo.

I modified it slightly to get it working on my system

tmsimont commented 7 years ago

@jcanny Just curious -- what did you use to evaluate accuracy in the past? I'm thinking bad accuracy of the model comes from the large batch size, but smaller batch sizes yield an error:

nn: BIDMach.Learner = Learner(BIDMach.datasources.FileSource@38cdca5b,BIDMach.networks.Word2Vec@4ea0397f,null,null,null,BIDMach.networks.Word2Vec$FDSopts@791ee92c)
opts: BIDMach.networks.Word2Vec.FDSopts = BIDMach.networks.Word2Vec$FDSopts@791ee92c
opts.nstart: Int = 0
opts.nend: Int = 7
opts.npasses: Int = 4
opts.batchSize: Int = 100
opts.lrate: BIDMat.FMat = 0.00010000
opts.vexp: BIDMat.FMat = 0.50000
opts.nreuse: Int = 5
opts.dim: Int = 200
opts.vocabSize: Int = 253854
opts.useGPU: Boolean = false
nmmats = 1
pass= 0
scala.MatchError:    4065      1    584      2    432   4586   7919      2      0    131      0     66     42   4238     10      0...
  30000  30000  30000  30000  30000  30000  30000  30000  30000  30000  30000  30000  30000  30000  30000  30000...
 (of class BIDMat.IMat)
  at BIDMach.models.Model$$anonfun$copyMats$1.apply$mcVI$sp(Model.scala:285)
  at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160)
  at BIDMach.models.Model.copyMats(Model.scala:255)
  at BIDMach.models.Model.dobatchg(Model.scala:200)
  at BIDMach.Learner.nextPass(Learner.scala:145)
  at BIDMach.Learner.firstPass(Learner.scala:114)
  at BIDMach.Learner.retrain(Learner.scala:88)
  at BIDMach.Learner.train(Learner.scala:75)
  ... 54 elided
mod: BIDMach.networks.Word2Vec = BIDMach.networks.Word2Vec@4ea0397f

Any idea what is causing that error (or the poor accuracy)?

tmsimont commented 7 years ago

@jcanny More problems to report...

I'm trying to now run this on CPU for comparison to the poor results from the GPU ouput, and I see now that the fmt.txt file you uploaded seems to allow some junk data to appear in the output file. I see this in the model output:

estimated
competition
<FD>
Corp
ways
begin

I'm not sure what that <FD> is but it appears a few times and as a result a few of the saved vectors are treated as having only 299 elements instead of 300. This breaks the Hyperwords test.

I'm assuming that the fmt.txt is treating some special characters as words. I'm still mystified about what exactly is going on with this fmt.txt file you uploaded:

,.()&"$£€

It looks kind of like regex?

To be honest I just want to verify that your model is performing at the speed reported in "Machine Learning at the Limit" (or better on new GPU) and that produces a useable set of vectors. I almost convinced that it performs quickly, but I'm seeing really poor results in the output. The accuracy of the GPU-generated model is near 0 in the Hyperwords tests.

How have you previously verified accuracy of the output? (This wasn't reported in the paper)

Is this the fmt.txt file you used for the paper? It seems to not work.

I'm starting to get the sense that this implementation is flawed and does not produce the kind of word vectors that Word2Vec should produce, unless there is something in the fmt.txt file that is throwing the results off.