chrisdonahue / wavegan

WaveGAN: Learn to synthesize raw audio with generative adversarial networks
MIT License
1.32k stars 283 forks source link

Speed and Training duration Issue for Piano and Drums Datasets - Observation #40

Closed saraalemadi closed 5 years ago

saraalemadi commented 5 years ago

Hi Chris,

I have been testing waveGAN on the given datasets, namely, Piano and Drums. I have noticed that, although, the piano dataset have inconsistent longer audio clips with the following specs (shown below), the algorithms runs very fast on my Titan V setup, which, in 2 days reached to 200k training steps. However, for Drums dataset, where audio files are all of 1 sec length and specs shown below, it takes literally days to fill the buffer and train (2 days for 16 steps only). Hence I was wondering if that is an issue in V2 version of the code and how it can be resolved?

Thanks, Sara

Screenshot 2019-04-22 at 16 09 07 Screenshot 2019-04-22 at 16 13 57
chrisdonahue commented 5 years ago

Sorry, this information really should be in the README.

The WaveGAN training script extracts random "slices" from the dataset audio for training. However, the ideal slicing behavior for datasets of long audio files (e.g. music) is very different from that of datasets with short audio files (e.g. drum one shots). The WaveGAN training script does not know the length of all of the audio files specified during training, so you have to configure the slicing behavior manually.

If you add the arg --data_first_slice to your drum training script, I believe the issue should be resolved. This instructs the data loader to only use the leftmost "slice" from each audio file (appropriate for drum one-shots), and pads the end accordingly if there is not enough data for a complete slice.

polisen commented 5 years ago

What constitutes a slice? Is it an arbitrary amount of samples set by you or possibly automatically by the script? My problem has been experimenting with drums samples, each 33kb (mono-16bit-16khz-1000ms) and finding that even though almost half of them are filling the whole second - such as a longer ride-like hit versus a short 10ms rimshot-dealios it prefers generating shorter samples. Do you think that could be resolved by training exclusively “longer” 1000ms samples without —first-slice-only? On 22 Apr 2019, 21:05 +0200, Chris Donahue notifications@github.com, wrote:

Sorry, this information really should be in the README.

The WaveGAN training script extracts random "slices" from the dataset audio for training. However, the ideal slicing behavior for datasets of long audio files (e.g. music) is very different from that of datasets with short audio files (e.g. drum one shots). The WaveGAN training script does not know the length of all of the audio files specified during training, so you have to configure the slicing behavior manually.

If you add the arg --data_first_slice to your drum training script, I believe the issue should be resolved. This instructs the data loader to only use the leftmost "slice" from each audio file (appropriate for drum one-shots), and pads the end accordingly if there is not enough data for a complete slice.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/chrisdonahue/wavegan/issues/40#issuecomment-485516225, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AHAD6NFK5ZNZD5BDU4ZJFYDPRYD75ANCNFSM4HHPUOBA.

saraalemadi commented 5 years ago

Hi @chrisdonahue,

Thanks for the reply. I tried using --data_first_slice but it didn't have an effect on the speed nor the duration. What I ended up doing is feeding a very long audio file of x dataset (similar to the drums) instead of preprocessing it in chunks beforehand. This worked perfectly in terms of speeding up the process (similar scenario to the piano dataset).

-Sara

chrisdonahue commented 5 years ago

@polisen If you find the length insufficient, you could increase the size of the model outputs with --data_slice_len=32768 or --data_slice_len=65536. This would result in ~2 and ~4 second samples respectively, although greatly increase the size of the model and training time. I would recommend still using --data_first_slice even if you change the model size.

@saraalemadi Glad to hear that you resolved it. Not sure why --data_first_slice wasn't working. If your audio samples are shorter than the training window (i.e. less than 16384 samples), you also need to use --data_pad_end along with --data_first_slice. The problem with the approach you mentioned is that the model might be seeing snippets of two drum sounds in a single training example, though this might not be a problem in practice.

seimipark commented 5 years ago

Hi @chrisdonahue,

Approximately how long should it take to train on the provided piano dataset? Thanks!

-Seimi

chrisdonahue commented 5 years ago

@seimipark It should start producing pretty reasonable results after only 10k steps (a few hours on GPU). The model we trained in the paper was trained for 200k steps, which took a little over 2 days.