asappresearch / sru

Training RNNs as Fast as CNNs (https://arxiv.org/abs/1709.02755)
MIT License
2.11k stars 306 forks source link

Zoneout support #137

Open bratao opened 4 years ago

bratao commented 4 years ago

Hello,

This library is vital to our pipeline, we got a great speedup and performance improvement compared to LSTM. Thanks @taolei87 and asappresearch team.

One thing that I was experimenting with the also great haste library (https://github.com/lmnt-com/haste) was the Zoneout (https://arxiv.org/abs/1606.01305) support. It really improved all our metrics compared to regular dropout.

Is this something that could make into SRU?

taoleicn commented 4 years ago

Hi @bratao ,

Thank you and it's great to hear SRU is useful for you.

Could you please share more details, such as:

  1. how much of an improvement did you observe using zoneout? for what NLP task(s)?
  2. did you use a binary mask or float mask for the zoneout implementation (new_c[t] = c[t]mask c[t-1]*(1-mask))?
  3. is the shape of the mask per layer (length, batch_size) or (length, batch_size, hidden_size)? i.e. does a mask value apply to an entire vector or a dimension of the vector?
bratao commented 4 years ago

Hi @taolei87 ,

  1. I do super-long, almost infinite sequence labeling on legal documents. Almost all RNN networks are too sensitive to overfitting as my training set is small. On my task, the best result using SRU after a exhaustive grid search on the hyperparams is a f-1 of 0.97. Pytorch LSTM get an f-1 of 0.97 too, but it is slower. Using haste LSTM with zoneout I can get a f-1 of 0.98(0.97 without).

2-3. I do not know, I just used what haste did.