facebookresearch / StarSpace

Learning embeddings for classification, retrieval and ranking.
MIT License
3.94k stars 531 forks source link

question on optimal settings of hyperparameters #243

Closed jwijffels closed 5 years ago

jwijffels commented 5 years ago

Hi @ledw I have a general question on approaches you advise to tune the hyperparameters of the Starspace models. It would be nice if you can give some advise on what would be your general approach in tweaking hyperparameters in order to get an optimal model based on looking at the loss mainly. I generally need to tweak in order to get a good model

  1. Some hyperparameters which do influence the general range of the loss

    • type of loss (hinge/softmax)
    • margin (most of the time I need to increase that to values a lot higher than the default such that it learns something in case of hinge loss)
    • similarity metric (cosine/dot product)
    • negSearchLimit
    • dimension of the embeddings
  2. Some hyperparameters not affecting the general range of the loss

    • learning rate lr
    • the number of epochs
    • adagrad

My general approach is to give some sensible settings which work and to look at the evolution of the loss over the epochs to see that it learns something (on validation data loss steadily decreases - example of such a graph below) and then manually inspect some embedding similarities between labels and terms in the model to see if the embeddings really make sense. Rplot But then if I want to compare across different settings of hyperparameters enumerated in point 1, it is hard as they change the range of the loss.

I would like to know your general approach how to get the best setting of loss/margin/similarity metric/negSearchLimit given that if you change these parameters, the range of the loss also changes? Many thanks for any input.

tharangni commented 5 years ago

I noticed that the batch size also affects the loss. Smaller batches (5-16) converge faster whereas larger batches (300) were just oscillating between a range and didn't converge at all (dim = 20).

jwijffels commented 5 years ago

@tharangni, the question is not about the speed of convergence but about the size of the loss

tharangni commented 5 years ago

@jwijffels ahh i should have rephrased my sentence better. apologies. what i meant was that batch sizes affected the magnitude of loss. i.e. larger batch sizes = high loss.

ledw commented 5 years ago

@jwijffels Hi, sorry for the delay in responding. I would recommend that you compare the hyper-parameters on the evaluation metric of a validation set rather than the loss on the validation set. For instance, in the example of fb15k, you can test different hyper-parameters by optimizing the hit@10 metric on validation dataset.

jwijffels commented 5 years ago

Ok, thanks for the feedback.