NTMC-Community / MatchZoo

Facilitating the design, comparison and sharing of deep text matching models.
Apache License 2.0
3.84k stars 897 forks source link

RankCrossEntropyLoss produces NAN #772

Closed jibrilfrej closed 5 years ago

jibrilfrej commented 5 years ago

Describe the bug

RankCrossEntropyLoss produces NAN on some models (at least DUET and DRMMTKS)

It does not happen all the time: sometimes the model train without problems and other times the loss becomes nan after a few epoch.

I do not have this problem with the RankHingeLoss but the performance of the models is is bad compared to the RankCrossEntropyLoss (when it does not fail)

The problem appears MORE frequently on larger training sets (unfortunately I cannot share them).

The problem appears LESS frequently when I set model.params['embedding_trainable'] to False.

The problem appears on BOTH CPU and GPU

I tried to reduce the learning rate but it did not solve the problem

Here is a piece of code to reproduce the bug. As I have said before it does not happen all the time: I ran the code below 5 times and it produced nan 4 out of 5 times (around epoch 120)

import matchzoo as mz
train_pack = mz.datasets.wiki_qa.load_data('train', task='ranking')
task = mz.tasks.Ranking(loss=mz.losses.RankCrossEntropyLoss(num_neg=2))
embedding = mz.datasets.embeddings.load_glove_embedding(dimension=300)

model, preprocessor, data_generator_builder, embedding_matrix = mz.auto.prepare(
                    task=task,
                    model_class=mz.models.duet.DUET,
                    data_pack=train_pack,
                    embedding = embedding
                    )

train_preprocessed = preprocessor.transform(train_pack,verbose=0)
train_gen = data_generator_builder.build(train_preprocessed)

model.fit_generator(train_gen, epochs=200)

Context

jibrilfrej commented 5 years ago

Solved by adding an epsilon to the log in the cross-entropy loss:

(/losses/rank_cross_entropy_loss.py line 51)

Original : return -K.mean(K.sum(labels * K.log(K.softmax(logits)), axis=-1))

New: return -K.mean(K.sum(labels * K.log(K.softmax(logits) + np.finfo(float).eps ), axis=-1))

uduse commented 5 years ago

Seems a reasonable fix. Would you mind PR this change?

jibrilfrej commented 5 years ago

I just did it (pull request #776)

uduse commented 5 years ago

Great job! I approved it and soon it will be merged.