bnosac / ruimtehol

R package to Embed All the Things! using StarSpace
Mozilla Public License 2.0
99 stars 13 forks source link

Predict method returning duplicate results #1

Closed bob-rietveld closed 5 years ago

bob-rietveld commented 5 years ago

Thanks for this package. It has really helped to integrate StarSpace in my workflow. My question is when I run predict on a model (from a model trained with trainingMode = 1) I get a nice dataframe with possible labels for my dataset. the data frame, however, contains duplicate results (e.g. same labels with same probabilities). Is this intended/due to StarSpace or an implementation feature/bug? Best Bob

jwijffels commented 5 years ago

I also saw that appearing when testing out (using another trainMode) the models & the predict functionality and I asked myself the same question. Still reading the paper thoroughly to understand why this is.

jwijffels commented 5 years ago

Can you share your flow / code?

bob-rietveld commented 5 years ago

Hi,

Below is my (very simple)code on evaluating a sentiment model. I can send you the data files if you like.

I upgraded to the latest version but I see that the evaluation is under construction (in the model_eval object), do I need to have special params to evaluate the model, or just be patient -;).

StarSpace now provides hit1 etc eval metrics but it would be nice if I were able to compute my own evaluation metrics (perhaps using something like yardstick https://tidymodels.github.io/yardstick/index.html) , is that possible do you think?

Thanks again for the nice work on this package.

# train star space model see https://github.com/facebookresearch/StarSpace for options
  model <- starspace( file = "data/sentiment_train.txt",
                      trainMode = 0,
                      epoch = 15
                      )

# save model
  starspace_save_model(model, 
                       file = "model/aspect_embeddings.tsv")

# inspect model
  embeddings <- data.table::fread("model/aspect_embeddings.tsv")

## evaluate model
   model_eval <- starspace( file = "data/sentiment_test.txt",
                            model = "textspace.bin",
                            trainMode = 0)
jwijffels commented 5 years ago

For evaluation, there is still some work. but you can use ruimtehol:::textspace(model$model, testFile = "path/to/testfile", OTHER ARGS) to get the starspace evaluation metrics while passing all the detailed Starspace arguments. If you are just using it for classification, there are so many other R packages which can calculate evaluation metrics of classification models. The only thing that I'm planning to add to this R package are a shorthand for the ruimtehol:::textspace(model$model, testFile = "path/to/testfile", OTHER ARGS) type of call

jwijffels commented 5 years ago

By the way, if you want the embeddings, you can just do as.matrix(model)

jwijffels commented 5 years ago

Hi @good-marketing Probably the reason why you got duplicates is that if you build a model with trainmode 0 (e.g. classification) and you did not set K. K indicates how many predictions you want. The default is 5. If you have only 2 classes in your sentiment analysis, that does not make sense. So set starspace(..., K=2) if it is a binary sentiment classification. Currently, you can only set K when you train the model, not when you predict.

jwijffels commented 5 years ago

Closing, please use the k argument in the predict functionality.