Closed serkansulun closed 3 years ago
Hi Serkan. Are you referring to the data loader for LakhNES or something specific to NES-MDB (this repository)?
If the former, yeah this is definitely a bit of a curious decision. However, believe it or not, this is fairly standard practice for training large LMs, since you can get training signal out of an entire minibatch instead of using zero padding and throwing away some parallelism.
In theory, the model can learn to not attend to information before the
Thanks for your answer. I have another question regarding the same topic. LMShuffledIterator
puts different samples (songs) in different batches, so how does this take advantage of Transformer-XL's memory? In this situation, it looks like the memory and the input sequence are created using different songs, am I wrong?
Hi Chris. Congrats for the great work!
I noticed that when the length of a song is shorter than the target length, multiple songs are concatenated in a single sample. I'm aware that there are tokens, but don't you think that this is somehow problematic, since the attention module will process two very different songs in a single pass?
As a side question, is there a particular reason why you haven't used Pytorch's data loader?
Thanks in advance.