Thank you and it's great to hear SRU is useful for you.
Could you please share more details, such as:
how much of an improvement did you observe using zoneout? for what NLP task(s)?
did you use a binary mask or float mask for the zoneout implementation (new_c[t] = c[t]mask c[t-1]*(1-mask))?
is the shape of the mask per layer (length, batch_size) or (length, batch_size, hidden_size)? i.e. does a mask value apply to an entire vector or a dimension of the vector?
I do super-long, almost infinite sequence labeling on legal documents. Almost all RNN networks are too sensitive to overfitting as my training set is small. On my task, the best result using SRU after a exhaustive grid search on the hyperparams is a f-1 of 0.97. Pytorch LSTM get an f-1 of 0.97 too, but it is slower. Using haste LSTM with zoneout I can get a f-1 of 0.98(0.97 without).
Hello,
This library is vital to our pipeline, we got a great speedup and performance improvement compared to LSTM. Thanks @taolei87 and asappresearch team.
One thing that I was experimenting with the also great haste library (https://github.com/lmnt-com/haste) was the Zoneout (https://arxiv.org/abs/1606.01305) support. It really improved all our metrics compared to regular dropout.
Is this something that could make into SRU?