NTMC-Community / MatchZoo

Facilitating the design, comparison and sharing of deep text matching models.
Apache License 2.0
3.83k stars 899 forks source link

DSSM and data balance #736

Closed rajicon closed 5 years ago

rajicon commented 5 years ago

I have a matching problem with two sets of documents. In order to train a DSSM model, I created a dataset that takes each document pair that matches, and 5 negative samples (random pairs) per pair. This is then split into a training and test set. I then used the Classifier task.

When testing the model on the test set, it perform well, better than TF-IDF:

method | precision | recall | fscore TFIDF | .8284 | .3698 | .5113 DSSM | .7323 | .6925 | .7119

However, I then try to evaluate the model on the original set, which is basically every possible pair with label 1 for match and 0 for not matching. In this set, most pairs are not a match. Now, when comparing DSSM to TF-IDF, DSSM now does much worse:

method | precision | recall | fscore TFIDF | .1982 | .3497 | .1900 DSSM | .0464 | .7785 | .0805

I suspect this is due to the data imbalance between matches and not matches, and I was wondering how to handle these results. Should I reduce the negative samples so the model equally learns from positive or negative pairs? Should I increase them, so the model learns to generally predict nonmatches? Do these results suggest some potential error in my methodology?

bwanglzu commented 5 years ago

Hi @rajicon , you model was trained and evaluated on both training and validation set, is that correct?

During your training, they should bring roughly same level of performance, I don't undetsand:

  1. After training, why you evaluate the model on original training set again?
  2. If the dataset are the same, the results should be same as well, since evaluation metrics were independent from the model.
uduse commented 5 years ago

I hope things are working well for you now. I’ll go ahead and close this issue, but I’m happy to continue further discussion whenever needed.