UKPLab / sentence-transformers

State-of-the-Art Text Embeddings
https://www.sbert.net
Apache License 2.0
15.17k stars 2.47k forks source link

Some checks on training with numpy #18

Closed alejandrojcastaneira closed 5 years ago

alejandrojcastaneira commented 5 years ago

Hi, I'm training a STS using this code but over my domain data:

I'm getting these warnings:

/numpy/lib/function_base.py:2534: RuntimeWarning: invalid value encountered in true_divide c /= stddev[:, None] /scipy/stats/_distn_infrastructure.py:901: RuntimeWarning: invalid value encountered in greater return (a < x) & (x < b) /scipy/stats/_distn_infrastructure.py:901: RuntimeWarning: invalid value encountered in less return (a < x) & (x < b) /scipy/stats/_distn_infrastructure.py:1892: RuntimeWarning: invalid value encountered in less_equal cond2 = cond0 & (x <= _a) /numpy/lib/function_base.py:2535: RuntimeWarning: invalid value encountered in true_divide

Then the similarities on all the epochs are computed:

Cosine-Similarity : Pearson: nan Spearman: nan 2019-08-30 14:35:20 - Manhattan-Distance: Pearson: nan Spearman: nan 2019-08-30 14:35:20 - Euclidean-Distance: Pearson: nan Spearman: nan 2019-08-30 14:35:20 - Dot-Product-Similarity: Pearson: nan Spearman: nan

The examples with the STSbenchmark works Very Good! I'm just changing the train, dev, set files, I couldn't train on this data, could be associated with the vocabulary of the word embeddings? maybe that it could't contains some words from my corpus.

Bes regards

alejandrojcastaneira commented 5 years ago

I believe I found the cause of this, the thing it's that doing experiments with custom dev and test data, I created synthetic examples all with the cosine similarity score set to the maximum, so the standard deviations on the dev set and test set was 0.

Then the formula for Pearson's product-moment correlation coefficient divides the covariance of the results from training with the gold standard results on the dev and test sets by the product of their standard deviations.

Since my dev set and test set had zero variance, its standard deviation is also zero. That's why I got the true_divide error, then just introducing a small variance or using other data solved the problem. Sorry for my miss appreciation, this a Great library! Best regards

olastor commented 4 years ago

@alejandrojcastaneira Thank you for your analysis. Having the same problem and your explanation saves me a lot of time investigating I think.

mhsamavatian commented 4 years ago

same problem. Accurate explanation. Thanks

bierik commented 4 years ago

@alejandrojcastaneira Thank you for your explanation. I have exactly the same problem. I get a /scipy/stats/stats.py:3508: PearsonRConstantInputWarning: An inp ut array is constant; the correlation coefficent is not defined. which results in a nan value for the Cosine-Similarity Pearson and Spearman. I am using the opusparcus german train and dev set to continue the training on an nli trained german model. I tried to understand your answer but my maths skills are not good enough to get your answer. Can you give me an advice what to do to get rid of the nan value.

Vadbeg commented 4 years ago

Yep! Thanks a lot @alejandrojcastaneira. In my case I've used uniform distribution to solve this problem.

label = np.random.uniform(low=0.8, high=1.0)
input_example = InputExample(texts=texts, label=label)