facebookresearch / StarSpace

Learning embeddings for classification, retrieval and ranking.
MIT License
3.94k stars 531 forks source link

test mode load only part of the testFile content #189

Closed puppyapple closed 6 years ago

puppyapple commented 6 years ago

I have successfully train a model with tagspace model. But when I use the model to test for a testfile containing about 5million records, it loaded only 4500 examples around. Is that because there's a limitation for test records size?

ledw commented 6 years ago

@puppyapple There's no limitation for test. How did you run the command to evaluate your model? Did you set the same parameters as training? Is your test file of the same format of training?

puppyapple commented 6 years ago

@ledw thanks for the reply. My commands used are as below: Training command: /usr/Starspace/starspace train -trainFile ../Data/Input/fasttext/train_data_weighted -validationFile ../Data/Input/fasttext/valid_data -validationPatience 20 -dim 200 -minCount 5 -minCountLabel 100 -ngrams 5 -verbose 1 -thread 32 -model tagspace_01 -label '#' Testing command: /usr/Starspace/starspace test -testFile ../Data/Input/fasttext/test_data -predictionFile test02 -verbose 1 -thread 32 -model tagspace_02 -label '#' -K 10

One thing that I forgot to mention: the testFile that I used does not actually has ground true labels(unlabeled data), so I use "#ID #Name" (without weights) in each file for binding text data with their IDs and names(but somehow this won't work. I think it's because IDs and names don't show in dictionary of model?) Would this cause that the test process load only part of the testFile contents?

And btw, is it possible that I use a pre-trained word vectors with .tsv format for other trainings(in my case, text classification by trainMode 0). I've tried adding "-initModel xxx.tsv" to the command above for trainMode 0 but got "ERROR: File '../../Data/Input/fasttext/train_data_weighted' does not contain any valid example."

puppyapple commented 6 years ago

@ledw for the first problem, I've tried one way that seems to work: for the testFile with data unlabeled, I added '#mylabel' in each record where 'mylabel' is a label that apprears more than minCountLabel times in my training data. And then I run 'starspace test' at the new testFile, it can output prediction file. But I found that the example loaded for testing is still a little bit less then the testFile record lines(testFile with 5803940 lines but 5803918 loaded result by 'starspace test' command, I have drop the duplicated texts for sure). And because of this, I could not attach the output result with my record ID for further analysis.

ledw commented 6 years ago

@puppyapple Yes, if the label does not appear in train set, then it will ignore that in test. StarSpace filters out invalid examples, so if label/word is not in your dictionary (built from train file), it will be ignored. Further, if an example has 0 word/label (because of filtering), then the example itself is filtered. In your case, it could be that some examples in test contains 0 words that appeared in trainFile and got filtered.

You should be able to use -initModel xxx.tsv to initialize the model. Are you sure that -initModel xxx.tsv is the only additional parameters you pass to the training command? Is the dictionary of words and labels the same for your xxx.tsv model and the new model?

puppyapple commented 6 years ago

@ledw For the lost count of test examples, I think it's like what you mentioned, some examples have been filtered because of the 0 word/label in dictionary. Since each of my examples has a unique ID, I want to attach the test result with the IDs. If some of the examples will be filtered, then I cannot do the link after testing. Is there any suggestion?(Or I have to find the filtered examples by doing statistics with minCount and minCountLabel myself?) I think a simpler prediction result like mentioned in #160 plus a "-prefixID" option to specify ID term in datas when test would be pretty convinient.

For the initModel situation, my case in detail is like this: I have a multi-label text data set which I want to train a classification model, to predict another larget set of text data unlabeled. I'm afraid that the two data set are not exactly from the same distribution, so I trained a w2v with fasttext, and transformed it to .tsv format, with which I want to initialize my classification model(-trainMode 0), like a semi-supervised learning. It is sure that most of the label words doesn't not show in the w2v embeddings. I guess that's why I got the "ERROR"?

Thanks again for the reply.

puppyapple commented 6 years ago

Any good idea to match with IDs?

ledw commented 6 years ago

@puppyapple sorry for the delay in replying. For the ID match case, I think what you suggested is reasonable: i.e. adding a prefixID for test examples. If you would like, you can try to add that and send out a PR (let me know if you need any help).

For problem with initModel, you can try to add 0 vectors (or whatever vector you'd like to initialize with) for labels that does not appear in train set, to the trained .tsv file. That should solve your problem.

puppyapple commented 6 years ago

@ledw Thanks. I'm not so familiar with C++ but I will give it a try.