I would like to ask the authors why does EMAnet suffer from the vanishing / exploding gradient inherent in RNNs even though the EM iterations are unrolled only for a small number (in this case 3) of steps? Vanilla RNNs with with tanh non-linearities can typically work on sequences on the order of 100 time steps, and LSTMs can work on sequences on the order of 1000 time steps.
Since the mIOU peaks at a very small value of T_train, is vanishing / exploding gradients really the reason that the mIOU deteriorates for higher values of T_train (>3)? Have the authors by any chance printed the gradient norms of every layer to check for vanishing or exploding gradients?
Hello,
I would like to ask the authors why does EMAnet suffer from the vanishing / exploding gradient inherent in RNNs even though the EM iterations are unrolled only for a small number (in this case 3) of steps? Vanilla RNNs with with tanh non-linearities can typically work on sequences on the order of 100 time steps, and LSTMs can work on sequences on the order of 1000 time steps.
Since the mIOU peaks at a very small value of T_train, is vanishing / exploding gradients really the reason that the mIOU deteriorates for higher values of T_train (>3)? Have the authors by any chance printed the gradient norms of every layer to check for vanishing or exploding gradients?
Thank you in advance.