Preprocessing of Dataset to feed into LSTM

Can you please explain procedure or different steps to preprocess data before feed to LSTM. I am working on paper by Zhuo Chen on "Speaker-Independent Speech Separation With Deep Attractor Network", but I am not able to create batches because each audio file have different no of frames. So how do you handle variable length input to LSTM? I know techniques like padding sequence, but I dont think that would be effective because in difference of no of frames is large.

aishoot / LSTM_PIT_Speech_Separation

Preprocessing of Dataset to feed into LSTM #11