Open fanganjie opened 8 years ago
Hi everyone, I am trying to use word2vec with my own corpus but I have some issues when I try to load it. Could I have a small example to check what is wrong on mine?
Thanks in advance,
Miguel Angel.
I'm also not sure how to get Word2Vec working with Bidmach, so any help here would be appreciated.
I found a test file that seems to point to some example of Word2Vec use: https://github.com/BIDData/BIDMach/blob/07969c1cc95180ac0b3d8a59e299de8359cb7dec/scripts/getw2vdata.sh
However, this also seems to depend on tparse2.exe
What is tparse2.exe? Is there something that would work in its place on linux?
Hi Tmsimont,
tparse2.exe is an executable compiled https://github.com/BIDData/BIDMach/blob/07969c1cc95180ac0b3d8a59e299de8359cb7dec/src/main/C/newparse/tparse2.cpp
its main objective is to change the representation of the words into a matrix of two rows, which is the exactly representation that you need to utilise this library.
Hmm after using maven to build I see that have tparse.exe compiled. When I try to run this script, however, I still get errors. It seem that fmt.txt
is missing. Any idea where that comes from?
Nevermind.. it seems just about everything in that shell script is out of date and missing files. This seems to be leading me further away from figuring out how to get this to work.
I'd love to know just how to get a simple training process running on the text8 corpus, or any corpus for that matter. Is there no documentation anywhere on how to get word2vec up and running?
Hi Tmsimont, The "fmt.txt" is a format file, you have more information about it in this webpage https://github.com/BIDData/BIDMach/wiki/Data-Wrangling The aim of this file is to tell tparse2.exe what is the format of your text.
Yes I understand how do you feel, I have spent a lot of time to understand how this library works. It is quite painful. Firstly, I tried to compile by myself, but it was impossible a lot of dependencies, libraries and so on.
I hope this helps you
Sorry about this. If you're looking for installation instructions, the way to do so is using maven now. See the Installing and Running wiki.
Unfortunately I've never used word2vec, and @jcanny and us are busy with papers at the moment.
Thanks @MiguelAngelRG I'll take a look.
@DanielTakeshi the installation isn't really a problem. It's actually using BIDMach to learn word vectors that's the issue. Once BIDMach is installed I can run ./bidmach
and get a scala interpreter... but then what do you do to get Word2Vec to train on a corpus?
Specifically, in the paper Machine Learning at the Limit -- there is a report of 8.5M word/s on NVIDIA Titan-X. I'm trying to reproduce this on a Tesla P100 to see what the speeds are, but I have no idea how BIDMach was used to train a data set so quickly, or what data was trained at that speed, what format was the data was in, etc..
I'm going through a lot of BIDMach documents but I don't see much on how word2vec can be used within the framework.
@DanielTakeshi thanks for the clarification, now is working but it takes me a lot of time to solve it. The problem was that I only wanted to use it, and in this documentation does not say that it is required to install it in order to use it. That is why this was my last option. But then, after installing it was "easy" to make it work. On the other hand, I think it would be nice to have more documentation about how to work with this Word2Vec, specially with the Parallel version. @tmsimont no worries . it has been a pleasure. I hope I helped you.
Thank you @DanielTakeshi
@jcanny @DanielTakeshi I have one question.. how many cores do I need to have the same performance that you got in the paper titled:"Machine Learning at the Limit" I am doing some experiments with 2 o 3 nodes with 14x2 threads each + 3 GPUs and the BIDMach implementation needs a lot of time to generate the embeddings.
@tmsimont have you been able to achieve the speed that it is reported in "Machine learning at limit" I am using better GPUs and more cores and I have not even been closer to this performance. I have checked another frameworks and they are faster than BIDMach. :(
@MiguelAngelRG No -- I have not been able to get anything to run... I have written a few of my own custom kernels and no matter what I do I can only get about 6M words per second on a Tesla P100. I'm looking at the CUDA kernel code in BidMach and not sure how it would squeeze out an additional 2M words/sec, unless there are parameters that are much different (less negative samples, or something else...) It's hard to say since many parameters were not reported in the paper, and I can't figure out how to get a basic example to run.
@tmsimont I do not know either, to make it work, these are the scripts that I have used:
@MiguelAngelRG Thanks -- I was toying with these files, too, but couldn't figure out what the contents of fmt.txt
should be. I know this file designates the format of the data that Word2Vec needs, but I'm not sure what it should be.
What does your fmt.txt
look like?
@tmsimont it depends on the text that you want to use. Here you can see the different parameters that you can use to build your formatter txt.
@tmsimont Also here you have different format that they provide: https://github.com/BIDData/BIDMach/blob/master/data/rcv1_fmt.txt https://github.com/BIDData/BIDMach/blob/master/data/uci_fmt.txt https://github.com/BIDData/BIDMach/blob/master/data/uci_wfmt.txt
I hope this help you.
Thanks, @MiguelAngelRG
I saw these files you've linked to previously and it doesn't seem like any of them would properly describe the input structure of the typical Word2Vec training data set.
The 1-billion-word files that are linked in that shell script are in the form of a single sentence per line. (As are many other similar training sets)
I've read these a few times now: https://github.com/BIDData/BIDMach/wiki/Data-Wrangling#word-fields https://github.com/BIDData/BIDMach/wiki/Data-Wrangling#string-fields
It says for a schema word sometext
The word field type produces a single numeric token for each field, using a dictionary. e.g. "the white house" encodes as the literal string "the white house".
So does that mean each sentence has its own token? That doesn't really make any sense for Word2Vec. Word2Vec should tokenize words in a sentence, and then learn by examining surrounding word contexts.
Furthermore... what is sometext
? The uci_wfmat.txt
says word term
and not word sometext
... What does this mean?
I tried using the format in uci_wfmat.txt
, and the getw2vdata.sh
script seems to hang after
...
Processing news.en-00097-of-00100
305532 lines processed
Processing news.en-00098-of-00100
306180 lines processed
Processing news.en-00099-of-00100
305893 lines processed
Were you able to get past the tparse2 call in that script? I'm not sure what is going on when the script stalls out on me.
Oops I posted too soon. The script does finally get passed the tparse execution...
Then if fires off the getw2vdata.ssc
only to trigger a bunch of file not found errors... It looks like it's searching for sbmat.gz
files that were not created during the tparse exeution... oy...
OK... so... word term
apparently named some of the outputs of tparse
as ...term..
and term...
So I changed those to match the format I see in getw2vdata.ssc
and lo and behold the File Not Found errors disappear... only to yield more errors:
Loading getw2vdata.ssc...
Switched off result printing.
java.lang.NegativeArraySizeException
at BIDMat.SBMat$.SnoRows(SBMat.scala:144)
at BIDMat.HMat$.loadSBMat(HMat.scala:999)
at BIDMat.HMat$.loadSBMat(HMat.scala:980)
at BIDMat.MatFunctions$.loadSBMat(MatFunctions.scala:2021)
... 54 elided
<console>:29: error: not found: value words0
val words = words0(ii);
^
<console>:29: error: not found: value words
saveCSMat("../data/word2vec/data/dict.csmat.lz4", words);
^
<console>:30: error: not found: value words
val map1 = dict2 --> Dict(words);
^
java.lang.RuntimeException: col index out of range 1 1
at BIDMat.DenseMat$mcI$sp.gapply$mcI$sp(DenseMat.scala:456)
at BIDMat.DenseMat$mcI$sp.gapply$mcI$sp(DenseMat.scala:498)
at BIDMat.IMat.apply(IMat.scala:96)
at .repartition(<console>:47)
... 60 elided
<console>:36: error: not found: value map1
map1,
^
Switched on result printing.
Any idea what that is about?
@tmsimont Firstly, the format file depends on how you want to process the training files. If you want to use lines or words, the format file will be different. Secondly, I think that the problem that you are having is the path is not correct (I guess) and also are you using bidmat or bidmach to execute Word2Vec, because you need bidmach. I think that there is some libraries that have not included in the path, could it be?
I've tried different training files and double checked paths, etc. The core of the problem I'm facing now seem to be the failure of this: val words0 = CSMat(loadSBMat(dir + "sentence.sbmat.gz"));
When that loadSBMat()
function hits the output of tparse2
it chokes:
java.lang.NegativeArraySizeException
at BIDMat.SBMat$.SnoRows(SBMat.scala:144)
at BIDMat.HMat$.loadSBMat(HMat.scala:999)
at BIDMat.HMat$.loadSBMat(HMat.scala:980)
at BIDMat.MatFunctions$.loadSBMat(MatFunctions.scala:2021)
... 54 elided
I'm assuming the fmt.txt
file I've used is incorrect, which again leaves me back where I started. What is this file supposed to look like?
@MiguelAngelRG What data set and fmt.txt
are you using? Does your data file contain line breaks? Are the sentences delimited by period or line break, or are they all mashed together into 1 line (like the text8 file)?
@tmsimont I believe that the problem that you are having is the fmt.txt file. In my case I am using text file and I am considering each word instead of complete lines or group of words. Of course my data contains line breaks "\n". If you look at the implementation of tparse2 there is d parameter (delimiter parameter) that you can use it to specify this. In addition to this, be sure that the file that you are generating by using tparse2 fit with the input of the instruction that you mentioned.
@jcanny Hey I see you were just making changes to the w2v scripts: https://github.com/BIDData/BIDMach/commit/3addc346bfc7b8c40e84c6b1824773ea179d2ce6
EDIT -- (accidentally submitted post before finishing)
Can you share your fmt.txt
that you are using in these scripts on the 1-billion-word benchmark?
Hi, I just checking in the missing fmt.txt. Its in BIDMach/data/word2vec/raw. If you pull from master, you will get it. Sorry, its not part of our normal test set, and didnt get included.
Everything else should be working. From BIDMach/scripts, run ./getw2vdata.sh to download and prepare the data files.
Then also from BIDMach/scripts, you can run ../bidmach testword2vec.ssc to build a W2V model pair.
I just tried both scripts and they worked fine on a vanilla ubuntu 14.04 machine.
With a Titan-X (Maxwell) and i7-2660 CPU, I just saw 8m words/sec on that dataset and script.
With a Titan-X Pascal GPU and i7-5930 CPU, I'm getting 11m words/sec on the same dataset.
@jcanny THANK YOU! That does the trick. I will play around and report back here any more questions/issues related to this and the speed I get, too. I'm guessing @MiguelAngelRG is having speed issues because of the way fmt.txt
was built?
fmt.txt contains separators for words on the lines of the input file. It looks something like this:
string sentence
<tab><space>,./?()%$#
String specifies a format with multiple words per line of input. "sentence" is the prefix of the data files that are created. The output is a 2 x m integer matrix with a word id and a sentence number in each column. The separators on the second line of the fmt file specify how to tokenize the input and help avoid junk words in the dictionary. And it there is no space on the second line, it wont split at spaces and there will be some very strange long tokens in the dictionary.
Also @MiguelAngelRG, similar to the standard word2vec, you'll get better performance if your dictionary is sorted in descending order of word frequency. That gives better utilization of the GPUs two caches.
@jcanny to perform my experiments I have used the scripts that you have put in the github repository: https://github.com/BIDData/BIDMach/blob/master/scripts/getw2vdata.ssc#L54 https://github.com/BIDData/BIDMach/blob/master/scripts/testword2vec.ssc https://github.com/BIDData/BIDMach/blob/master/scripts/testword2vecp.ssc As you know, this scripts has this function "val (vv, ii) = sortdown2(cnt0);" which I think this is the instruction that you refer to. However, the performance is incredibly bad. I give you an example, to get the embeddings from a vocabulary of 10868, BIDMach requires (45 min), however other implementations like Gensim or TensorFlow achieve the same goal in less than 15 minutes. So, could you please tell me if there is something that I forgot to take into account?
Thank you very much in advance.
@tmsimont please, if you are able to get the same performance that was published in the paper, let me know, because I am working intensively in order to know what is the problem? Thank you very much in advance.
@MiguelAngelRG, please make a detailed issue report. Here are some guidelines: https://sifterapp.com/blog/2012/08/tips-for-effectively-reporting-bugs-and-issues/ We need some basic information to figure out what's not working, and to be able to reproduce it. Its sounds a setup/installation problem, so the more information you can give us, the better chance we have to figure out what's going on. At a minimum include a trace of the output of the learner while its running.
If you have trouble with any other benchmarks, please provide a detailed report - at least the output trace, and how it was different from your expectations.
@jcanny regarding the word2vec example in the test script you posted...
I see there is saveGoogleW2V
in the Word2Vec.scala
file. I think I have got the example you uploaded re-worked to take in the text8
corpus, but I'd like to evaluate the vectors with Hyperwords to be sure. Can you explain how this function is supposed to be used? I'm trying to do the following with no success:
import BIDMach.networks.Word2Vec
val mdir = "../data/word2vec/data/"
val (nn, opts) = Word2Vec.learner(mdir+"t8lines.imat.lz4");
opts.nstart = 0;
opts.nend = 7;
opts.npasses = 4;
opts.batchSize = 1000000;
opts.lrate = 1e-4f
opts.vexp = 0.5f
opts.nreuse = 5
opts.dim = 300
opts.vocabSize = 1000000
opts.useGPU = true;
//opts.autoReset = false;
//Mat.useMKL = false;
nn.train
val mod = nn.model.asInstanceOf[Word2Vec]
Word2Vec.saveGoogleW2V(loadCSMat(mdir+"t8lines.imat.lz4"), FMat(mod.modelmats(0)), "../data/word2vec/vectors-f8.txt", true);
That looks right. But are you sure you want a Google format file?
Hmm perhaps not. All I really want is the input binary for the vectors and the vocabulary. How would I generate the input for Omar Levy's Hyperwords?
Something similar to the original Word2Vec's -binary 1 -output vectors.bin -save-vocab vocab.txt
arguments?
I've tried to adjust the saveGoogleW2V
call and get this:
at BIDMat.DenseMat.apply(DenseMat.scala:58)
at BIDMach.networks.Word2Vec$.saveGoogleW2V(Word2Vec.scala:1360)
Adjusted call:
val mod = nn.model.asInstanceOf[Word2Vec]
Word2Vec.saveGoogleW2V(loadCSMat(mdir+"t8dict.csmat.lz4"), FMat(mod.modelmats(0)), "../data/word2vec/vectors-f8.txt", true);
The t8dict.csmat.lz4
file comes from a script similar to getw2vdata.ssc
, where I call:
val dir = "../data/word2vec/t8tokenized/";
val words0 = CSMat(loadSBMat(dir + "sentence.sbmat.gz"));
val cnt0 = loadIMat(dir + "sentence.cnt.imat.gz");
val (vv, ii) = sortdown2(cnt0);
val words = words0(ii);
saveCSMat("../data/word2vec/data/t8dict.csmat.lz4", words);
saveDMat("../data/word2vec/data/t8dict.dmat.lz4", DMat(vv));
val map0 = invperm(ii);
repartition(dir+ "t8lines_sentence.imat.gz",
"../data/word2vec/data/t8lines.imat.lz4",
map0,
0, 1, size);
Sorry -- last post was missing the full error:
mod: BIDMach.networks.Word2Vec = BIDMach.networks.Word2Vec@3035c979
java.lang.IndexOutOfBoundsException: 253854 >= (253854)
at BIDMat.DenseMat.apply(DenseMat.scala:58)
at BIDMach.networks.Word2Vec$.saveGoogleW2V(Word2Vec.scala:1360)
... 54 elided
We havent touched the word2vec code for a while. At that time, Google word2vec used to output a mixed ascii-binary file, which is what saveGoogleWord2Vec tries to do. Hopefully that's deprecated now.
If you tell me what's supposed to be in vectors.bin and vocab.txt, it should be easy to reproduce them.
@jcanny
The vocab.txt file is written here:
https://github.com/svn2github/word2vec/blob/master/word2vec.c#L298
It looks like for each word, there is a line in the format:
string word representation of it
The binary file is written as follows:
for (a = 0; a < vocab_size; a++) {
fprintf(fo, "%s ", vocab[a].word);
if (binary) for (b = 0; b < layer1_size; b++) fwrite(&syn0[a * layer1_size + b], sizeof(real), 1, fo);
else for (b = 0; b < layer1_size; b++) fprintf(fo, "%lf ", syn0[a * layer1_size + b]);
fprintf(fo, "\n");
}
This does seem to be pretty much what your saveGoogleW2V function does, right?
Yes, I just tried saveGoogleWord2Vec and it seemed to work fine. I updated the script to call it in the test script.
Did you check the size of your vocabulary? e.g. do "words.length" from the command line.
Make sure opts.vocabSize is not larger than your actual dictionary size. that will cause problems.
Getting closer!
There are still a few things that are a little confusing here:
1) How can I get the corresponding vocab.txt
? (for each word in vocab: string count\n
in an order matching the vectors.bin
output). I can see how I could do this by creating a function similar to that saveGoogleW2V
function, but I feel like this data is already here somewhere... maybe in one of these lz4
files? (When I try to decompress those lz4's with lz4 -d
I get an error about a bad header...)
2) What is that words.length
value parameter? Intuitively it seems like this is a count of the number of words, but something is odd here.. The output of words.length
is 253854
-- that seems reasonable for the text8
corpus, but when I wc -l output.bin
on the output of saveGoogleW2V
the ouptut vector file has 1331418
lines -- that's a lot more than the number of words in the vocabulary, so it seems something is wrong?
EDIT: nevermind on that first question... I can work around this.
My apologies... The line count is disrupted by the binary output. If I pass in false
to the binary parameter it seems to work. Almost there...
I tried the test and it works, but the resulting vectors have terrible accuracy in they Hyperwords tests.
The word similarity score on the text8 set is only 1.4% Word analogy on text8 is 0%
For the billion word data set it's only 24% for word similarity test and 7.5% on the word analogy test.
By contrast, the original C implementation has the following scores: Text8: 64% on the word similarity 17% analogy Billion word: 70% word similarity 73.5% analogy score text8
(more about these tests in intel's paper on word2vec and omar's bitbucket page)
Have you used anything like this to evaluate the output in the past? Accuracy evaluation wasn't in the "Machine learning at the limit" paper, so I've been curious about these metrics on the output of something created in batches.
I see this in the output of ./bidmach testword2vec.ssc
:
score: BIDMat.FMat = -1.1464
I'm not sure what that means, but is that consistent with what you're getting? Could I be doing something wrong with my parameters to get such bad results?
Here's the script that I ran (should be right out of the repo, but I did change from binary to non-binary format on save):
import BIDMach.networks.Word2Vec
val mdir = "../data/word2vec/data/"
val (nn, opts) = Word2Vec.learner(mdir+"train%05d.imat.lz4");
opts.nstart = 0;
opts.nend = 7;
opts.npasses = 4;
opts.batchSize = 1000000;
opts.lrate = 1e-3f
opts.vexp = 0.5f
opts.nreuse = 5
opts.dim = 300
opts.vocabSize = 100000
opts.useGPU = true;
//opts.autoReset = false;
//Mat.useMKL = false;
nn.train
val mod = nn.model.asInstanceOf[Word2Vec]
//saveFMat(mdir+"model0.fmat.lz4", FMat(mod.modelmats(0)))
//saveFMat(mdir+"model1.fmat.lz4", FMat(mod.modelmats(1)))
val test = loadIMat(mdir+"test00000.imat.lz4");
val (mm,mopts) = Word2Vec.predictor(mod, test);
mopts.useGPU = opts.useGPU
mm.predict
val score = mean(mm.results(0,0->(mm.results.ncols-2)));
val dict = loadCSMat(mdir+"dict.csmat.lz4");
Word2Vec.saveGoogleW2V(dict, FMat(mod.modelmats(0)), mdir+"googmodel.bin", false);
Most likely its a difference in tokenization/vocabulary. Let me know exactly what you ran to get scores for both systems, so I can resolve the differences.
To generate the billion word results I just ran your test scripts. The only thing I changed was the 3rd parameter on the saveGoogleW2V
call to false.
After that, I ran this shell script:
code=hyperwords
src_model=$1
model=model
cp $src_model $model.words
python2 $code/hyperwords/text2numpy.py $model.words
echo "WS353 Results"
echo "-------------"
python2 $code/hyperwords/ws_eval.py embedding $model $code/testsets/ws/ws353.txt
echo
echo "Google Analogy Results"
echo "----------------------"
python2 $code/hyperwords/analogy_eval.py embedding $model $code/testsets/analogy/google.txt
echo
(called as ./eval.sh googmodel.bin
)
Note this script calls on the hyperwords python code (point the code
variable to the source)
This test script is from Intel's research repo.
I modified it slightly to get it working on my system
@jcanny Just curious -- what did you use to evaluate accuracy in the past? I'm thinking bad accuracy of the model comes from the large batch size, but smaller batch sizes yield an error:
nn: BIDMach.Learner = Learner(BIDMach.datasources.FileSource@38cdca5b,BIDMach.networks.Word2Vec@4ea0397f,null,null,null,BIDMach.networks.Word2Vec$FDSopts@791ee92c)
opts: BIDMach.networks.Word2Vec.FDSopts = BIDMach.networks.Word2Vec$FDSopts@791ee92c
opts.nstart: Int = 0
opts.nend: Int = 7
opts.npasses: Int = 4
opts.batchSize: Int = 100
opts.lrate: BIDMat.FMat = 0.00010000
opts.vexp: BIDMat.FMat = 0.50000
opts.nreuse: Int = 5
opts.dim: Int = 200
opts.vocabSize: Int = 253854
opts.useGPU: Boolean = false
nmmats = 1
pass= 0
scala.MatchError: 4065 1 584 2 432 4586 7919 2 0 131 0 66 42 4238 10 0...
30000 30000 30000 30000 30000 30000 30000 30000 30000 30000 30000 30000 30000 30000 30000 30000...
(of class BIDMat.IMat)
at BIDMach.models.Model$$anonfun$copyMats$1.apply$mcVI$sp(Model.scala:285)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160)
at BIDMach.models.Model.copyMats(Model.scala:255)
at BIDMach.models.Model.dobatchg(Model.scala:200)
at BIDMach.Learner.nextPass(Learner.scala:145)
at BIDMach.Learner.firstPass(Learner.scala:114)
at BIDMach.Learner.retrain(Learner.scala:88)
at BIDMach.Learner.train(Learner.scala:75)
... 54 elided
mod: BIDMach.networks.Word2Vec = BIDMach.networks.Word2Vec@4ea0397f
Any idea what is causing that error (or the poor accuracy)?
@jcanny More problems to report...
I'm trying to now run this on CPU for comparison to the poor results from the GPU ouput, and I see now that the fmt.txt
file you uploaded seems to allow some junk data to appear in the output file. I see this in the model output:
estimated
competition
<FD>
Corp
ways
begin
I'm not sure what that <FD>
is but it appears a few times and as a result a few of the saved vectors are treated as having only 299 elements instead of 300. This breaks the Hyperwords test.
I'm assuming that the fmt.txt
is treating some special characters as words. I'm still mystified about what exactly is going on with this fmt.txt
file you uploaded:
,.()&"$£€
It looks kind of like regex?
To be honest I just want to verify that your model is performing at the speed reported in "Machine Learning at the Limit" (or better on new GPU) and that produces a useable set of vectors. I almost convinced that it performs quickly, but I'm seeing really poor results in the output. The accuracy of the GPU-generated model is near 0 in the Hyperwords tests.
How have you previously verified accuracy of the output? (This wasn't reported in the paper)
Is this the fmt.txt
file you used for the paper? It seems to not work.
I'm starting to get the sense that this implementation is flawed and does not produce the kind of word vectors that Word2Vec should produce, unless there is something in the fmt.txt
file that is throwing the results off.
Could you give an example how to use BidMach word2vec by CPU and GUP respectively?
Thanks in advance, Andy.