Issues of replicating the results

nixingyang commented 8 years ago

Hi,

As the authors used the copyrighted songs (Madeon and David Bowie) in the original project, I fed the neural network with some other sound data sets instead. I wonder if anyone has encountered similar issues shown below.

The training loss kept decreasing and converged at ~0.4. However, the loss on the validation data set kept increasing throughout the training procedure.
The generated sound does not contain useful signal even if it was generated by using the seed from the training data set. The model was obtained at Epoch 2000.

BR.

gb96 commented 8 years ago

Today I found my model trained on different music generated what sounded like white noise.

My problem appeared to be due to some of the converted wav files (generated in datasets/YourMusicLibrary/wave/ ) being mono 32bit PCM audio at 8kHz whereas the GRUV conversion functions assume mono 16bit PCM audio at 8kHz.

If you find some of your wav files have the wrong bit-size, you can convert them with sox, e.g.: sox oldfile.wav -b 16 newfile.wav

This might be the cause of the second issue you mentioned.

The first issue is probably due to over-fitting. Your trained model fits the training data well, but does not generalize to the validation data. However you want your validation loss to decrease at some point epochs earlier on. Some people have reported that for LSTM networks the validation loss can move up and down unpredictably during training before the optimal minimum is reached.

nixingyang commented 8 years ago

@gb96 Thanks for your reply. Have you trained a model which is capable of producing meaning sound? I re-implemented the code and I forgot to normalize the raw audio data. That might be the reason for these two issues.

gb96 commented 8 years ago

@nixingyang I have trained models that produce sound (e.g, https://soundcloud.com/gb96/stairway-to-gruv-hd512-epoch48000-loss067-seed3x3 )

Have you tried running the audio_unit_test or equivalent? (see https://github.com/MattVitelli/GRUV/blob/master/data_utils/parse_files.py#L190 )

That verifies methods for loading/saving sound files, converting between wave and Numpy formats, and converting between time-domain and frequency-domain representations (via Fast Fourier Transform and its reverse)

nixingyang commented 8 years ago

I have defined a function which is similar to audio_unit_test and I can confirm that the transformation process is lossless. The audio you shared contains informative sound at the beginning. However, the model simply repeats useless sound after that. My prediction does not contain informative sound at all. Did you modify the generate_from_seed function and did you train your model solely on 65 seconds audio?

gb96 commented 8 years ago

Looks like I have made some significant modifications to the generate_from_seed function. The main idea of my changes is to keep a fixed seed-sequence length. New predicted values are appended to the end and initial values are deleted from the beginning to maintain constant length.

# Extrapolates from a given seed sequence
def generate_from_seed(model, seed, sequence_length, data_variance, data_mean):
    seedSeq = seed.copy()
    output = []
    # The generation algorithm is simple:
    # Step 1 - Given A = [X_0, X_1, ... X_n], generate X_n + 1
    # Step 2 - Concatenate X_n + 1 onto A
    # Step 3 - Repeat MAX_SEQ_LEN times
    for it in xrange(sequence_length):
        seedSeqNew = model.predict(seedSeq) #Step 1. Generate X_n + 1
        # Step 2. Append it to the sequence
        newSeq = seedSeqNew[0][seedSeqNew.shape[1]-1]
        output.append(newSeq.copy()) 
        # Construct new seedSeq
        newSeq = np.reshape(newSeq, (1, 1, newSeq.shape[0]))
        seedSeq = np.concatenate((seedSeq, newSeq), axis=1)
        seedSeq = np.delete(seedSeq, 0, 1)
    # Finally, post-process the generated sequence so that we have valid frequencies
    # We're essentially just undo-ing the data centering process
    for i in xrange(len(output)):
        output[i] *= data_variance
        output[i] += data_mean
    return output

gb96 commented 8 years ago

To answer your question about training data, I used the first 65 seconds audio from each channel of a stereo source, for a total of 130 seconds. The reason I did that was because the source music had quite distinct sounds in each channel (e.g. guitar notes in one and vocals in the other) and I figured it would be easier to train a LSTM network on the separate sounds rather than the combined mono version.

nixingyang commented 8 years ago

The modification of generate_from_seed is reasonable. From my point of view, the algorithm devised in GRUV is not capable of handling real-world audio signals. Google has revealed WaveNet which is probably the state of the art.

MattVitelli / GRUV

Issues of replicating the results #21