Use small sequences (100). See page 9 in Graves 2014.
This looks like a good tutorial on backprop through time (BPTT) which I think is what Graves uses? The tutorial has a brief mention of truncating the gradient, although I think Graves does it in a different way.
Does Graves do it on "micro batches" of 100 or does he do a sliding window? Reread Graves 2014. Check his book. Check the PhD thesis he cites. Check his code.
Perhaps it's as simple as splitting data into batches and have an option to allow activations to persist between sequences or between batches.
Use small sequences (100). See page 9 in Graves 2014.
This looks like a good tutorial on backprop through time (BPTT) which I think is what Graves uses? The tutorial has a brief mention of truncating the gradient, although I think Graves does it in a different way.
Janczak. Identification of Nonlinear Systems Using Neural Networks and Polynomial Models. 2005 might also have some useful info
Does skaee mention this in his code or papers?
Does Graves do it on "micro batches" of 100 or does he do a sliding window? Reread Graves 2014. Check his book. Check the PhD thesis he cites. Check his code.
Perhaps it's as simple as splitting data into batches and have an option to allow activations to persist between sequences or between batches.