First and foremost, thanks a lot for making xLSTM open-source! This is fantastic!
I want to use xLSTM for next-token prediction. Especially for symbolic music datasets.
After reading the paper, I think I am ready to go. I want to train an xLSTM similar to how GPTs are being trained. GPT training uses a full sequence as input and the same sequence shifted one to the left as output. The training is highly parallel because of the causal mask in the multi-head attention.
Now I wonder, would I train xLSTM on similar input and output pairs? Token sequence in and shifted token sequence out?
I got the impression from the paper that this is possible using only mLSTM blocks. With the introduction of an sLSTM block this parallel training would not work.
Hi!
First and foremost, thanks a lot for making xLSTM open-source! This is fantastic!
I want to use xLSTM for next-token prediction. Especially for symbolic music datasets.
After reading the paper, I think I am ready to go. I want to train an xLSTM similar to how GPTs are being trained. GPT training uses a full sequence as input and the same sequence shifted one to the left as output. The training is highly parallel because of the causal mask in the multi-head attention.
Now I wonder, would I train xLSTM on similar input and output pairs? Token sequence in and shifted token sequence out?
I got the impression from the paper that this is possible using only mLSTM blocks. With the introduction of an sLSTM block this parallel training would not work.
Is that so?