The previous code is unfortunately passed a test case. When all sequences are shorter than maximum length, at 1st time step, the first dimension size of self.noise is 1 in TrimZero algorithm. Then, (Lazy) Dropout's self.noise is copied across time steps, presumably, by this, as a result, it can avoid an error incorrect size: only supporting singleton expansion (size=1) since the first dimension size of self.noise is always equal to 1.
Note that since Bayesian GRU with TrimZero should use monotonic sampling (the same dropout samplings across a batch) for dropouts, the performance is the same if an error is not occurred due to the distribution of sequence lengths.
The previous code is unfortunately passed a test case. When all sequences are shorter than maximum length, at 1st time step, the first dimension size of
self.noise
is 1 in TrimZero algorithm. Then, (Lazy) Dropout'sself.noise
is copied across time steps, presumably, by this, as a result, it can avoid an errorincorrect size: only supporting singleton expansion (size=1)
since the first dimension size ofself.noise
is always equal to 1.Note that since Bayesian GRU with TrimZero should use monotonic sampling (the same dropout samplings across a batch) for dropouts, the performance is the same if an error is not occurred due to the distribution of sequence lengths.