Source loss diverges after 100 iterations

biomedia-mira / masf

Domain Generalization via Model-Agnostic Learning of Semantic Features

Apache License 2.0

146 stars 18 forks source link

Source loss diverges after 100 iterations #7

Closed mhk257 closed 4 years ago

mhk257 commented 4 years ago

Hi,

I tried to run your code using tensorflow 1.8. The source loss diverges after 100 iterations. It keeps on increasing and increasing. I don't know what's going on.

See below:

masf_art_painting.mbs_128.inner1e-05.outer1e-05.clipNorm2.0.metric1e-05.margin20.0 ('number of samples per category:', array([ 0., 19., 23., 20., 13., 16., 14.], dtype=float32)) global loss: 10.5261040 metric_loss: 282112.7500000 Iteration 303: Loss training domains 361542.0 Iteration 303: Accuracy training domains 35.095856

Your immediate response will be of great help!

Thanks,

tbuikr commented 4 years ago

@mhk257 and @carrenD : I got the error

    import special_grads
ModuleNotFoundError: No module named 'special_grads'

And

    from lib.utils import conv_block, fc, max_pool, lrn, dropout
ModuleNotFoundError: No module named 'lib'

How can you run the repo? I also used tensorflow 1.8 and python 3 .6 and 2.7

tbuikr commented 4 years ago

I fixed it by comment

#    import special_grads
# ModuleNotFoundError: No module named 'special_grads'

and change

from lib.utils import conv_block, fc, max_pool, lrn, dropout

from utils import conv_block, fc, max_pool, lrn, dropout

Please update it in the master branch

tbuikr commented 4 years ago

Same result with smaller lr rate

number of samples per category: [ 0. 18. 24. 19. 12. 20.  9.]
global loss: 13.1576300
metric_loss: 437102.5625000 
Iteration 3100: Loss training domains 616322.2
Iteration 3100: Accuracy training domains 29.498146
Unseen Target Validation results: Iteration 3100, Loss: 63254.097656, Accuracy: 0.124023
Current best accuracy 0.47900390625

carrenD commented 4 years ago

Hi,

I tried to run your code using tensorflow 1.8. The source loss diverges after 100 iterations. It keeps on increasing and increasing. I don't know what's going on.

See below:

masf_art_painting.mbs_128.inner1e-05.outer1e-05.clipNorm2.0.metric1e-05.margin20.0 ('number of samples per category:', array([ 0., 19., 23., 20., 13., 16., 14.], dtype=float32)) global loss: 10.5261040 metric_loss: 282112.7500000 Iteration 303: Loss training domains 361542.0 Iteration 303: Accuracy training domains 35.095856

Your immediate response will be of great help!

Thanks,

Hi,

I think the problem might be samples per category encounters an 0 for the first class. Please try to constrain the samples for each class as non-zero

carrenD commented 4 years ago

@mhk257 and @carrenD : I got the error
    import special_grads
ModuleNotFoundError: No module named 'special_grads'
And
    from lib.utils import conv_block, fc, max_pool, lrn, dropout
ModuleNotFoundError: No module named 'lib'
How can you run the repo? I also used tensorflow 1.8 and python 3 .6 and 2.7

Hi, sorry for the typo.

carrenD commented 4 years ago

I fixed it by comment

#    import special_grads
# ModuleNotFoundError: No module named 'special_grads'

and change

from lib.utils import conv_block, fc, max_pool, lrn, dropout

from utils import conv_block, fc, max_pool, lrn, dropout

Please update it in the master branch

thanks very much, updated

tbuikr commented 4 years ago

@mhk257 let me know if you already fixed it. Thanks

KongMingxi commented 4 years ago

@mhk257 let me know if you already fixed it. Thanks

I guess the issue is caused by _tf.onehot in data_generator.py file.

KongMingxi commented 4 years ago

I guess the issue is caused by _tf.onehot in data_generator.py file.

vihari commented 4 years ago

This thread does not answer satisfactorily why the source loss is diverging. I faced the same issue and fixed it by replacing utils.xent function with the following definition so as to be more careful when computing softmax with large logits.

def xent(pred, label):
    return tf.reduce_mean(-tf.cast(label, tf.float32)*(pred - tf.expand_dims(tf.math.reduce_logsumexp(pred, axis=-1), axis=-1)))