chrisdonahue / wavegan

WaveGAN: Learn to synthesize raw audio with generative adversarial networks
MIT License
1.32k stars 283 forks source link

Not an issue, just sharing results of my exploration. #52

Open go-dustin opened 5 years ago

go-dustin commented 5 years ago

I trained a model using the NIN Ghosts album. I was aiming for an aggressive wall of noise sound when I choose Ghosts, I think the model learned exactly what I had hoped it would. This is an interesting approach to sound design, it's almost like sampling (sound selection) but the outcome is more chaotic. There are some sounds that I would have never thought up and others that would have taken a very long time to program. https://archive.org/details/nineinchnails_ghosts_I_IV

The raw audio wasn't great but after some sculpting using a compressors, reverb, eq, distortion, rhythmic tools (gates, filters, etc) and synthesis. I got some interesting sounds. For this demo, I'm using 5 sounds that are either 1 or 4 seconds long (I trained two different models). I rarely have more than 2 sounds playing at once. The kick drum was created using a VST, that's not coming from the model. https://soundcloud.com/dustin-williams-25/gan-ghosts-sound-design-demo

Here are the raw unprocessed wav's. I'm going to make some changes and retrain a new model. I want to see what I can do to improve the sound quality. https://drive.google.com/open?id=1PqoxnyIFmutJsyD6KGu_GqIebP4yQkGV

go-dustin commented 5 years ago

When I did my first test it didn't occur to me that the model was less than 1% trained. Still impressive results from a very early prototype. :D

haideraltahan commented 5 years ago

Nice Work! How long it took you to train and on what hardware? @go-dustin I am trying to train on 300 1-sec speech recordings but the process seems very slow with a GTX 1070.

go-dustin commented 5 years ago

Thanks to be honest it was more about the sound design than the source material. I haven't gotten the raw wave to the quality level I want yet.

I used a snapshot that was at 3k cycles. This was running on a Telsa K80 in Google Cloud. My times won't be comparable to your since that is dependent on the size of the training data & GPU (cores/memory). I used about 1.6 hours worth of data. According to this page the diff in speed between a k80 & 1070 is substantial .33 TFLOPS vs 1.87+ TFLOPS.

I tried doing some speech training as well.. I don't think you'll get anything coherent out of this. I personally like the sounds I was getting and plan to use them. I just need to figure out how to increase the sample frequency. The 1600khz rate isn't very useful to me since it doesn't have the crunchy sound you typically get from downsampling.

chrisdonahue commented 5 years ago

Hey dude this is really awesome. Do you mind if I use this in future presentations as an example of humans employing generative models to assist in music production?

Have you tried training on 44.1kHz clips? You'll only be able to get ~1.48s clips using the maximum length (65536) but it might be more useful for you. If you know what length you would like to create I could help you modify the code to accomplish this.

libby-h commented 4 years ago

Hi @go-dustin and @chrisdonahue this is super cool. To add what I've been doing.

I've been playing around with this model at 44kz trained on a data set of 4 second snippets of Gregorian Chants, 234Mb in total (2413 items). I started training it almost 2 day ago on a RTX Titan. Here are my samples right now https://drive.google.com/open?id=11Ycww5u_L4cT4vfqHdWAZeT-ibT7wcva

Next I'm planning to alter the parameters of the model, on a small amount of data, and bench mark which model overfits best. Then I'll go back to training/testing/validating on a large data set of many different chants from across the world - (making sure it doesn't overfit).

chrisdonahue commented 4 years ago

Yeah!! These rock @libby-h . Always love hearing audio samples.

Can you tell me what your data looks like (e.g. number of files, length of each)? I can maybe suggest some different data loader configurations that might work better. Configuring these parameters tends to be a little opaque (my bad).

How do you plan on measuring which model(s) are overfitting? I don't know off hand of an easy way to measure this with a GAN. Let me know if you have ideas there

libby-h commented 4 years ago

Hey @chrisdonahue great to hear from you!

The data file I'm working with at the moment is this one https://drive.google.com/open?id=124YJhlZoQQfO959J0dWnK9vOX8zP6y-x (1,413 items, totalling 137.0 MB, 4 secs each) I'm currently working on building the final larger data set of many different types of chant - which will be considerably bigger. If I can gain an intuition now by training/testing on this linked data set, then hopefully I'll understand how to go to the larger one. Any tips you can give for this data set, or general info, re the data loader configs would be much welcomed! (I can share my progress in return if that's interesting). Would love to pick up some of the voices from the linked data sets. I haven't been able to yet!

By the way, did you ever modify the code for go-dustin to create longer clips at 44khz? Would love to get hold of that if it's possible (hopefully my GPU will handle a bigger model).

For overfitting, I had two ideas. Bear in mind though, that I'm an artist working with AI and not an AI dev. Maybe these wont work. The first idea was to train the model for a long enough time on a small-ish data set of two sounds, to force the gan to replicate only one of the sounds from the data set. Then I'd assume it had overfit. Obviously 'long enough time' would need to be discovered.

The other idea, was to train on a small-ish data set of similar sounds (my 'train' data), and then to continue training on another data set, same size as 'train' of similar but not identical sounds (my 'test' data). I'd then watch the convergence params to see if they start going back up considerably (or to a lesser extent) when I'm training on the 'test' data. Re-run these a few times to get a sense of what the model is responding to for different training durations.... The models where the convergence params went up the most, I'd assume to be most overfit.

In any case, any tips for working with this model would be super appreciated! In the end I'd love to be able to navigate the latent space live and for strange new sounds to morph from one type to another as I go. Dream Scenario

Thanks!

chrisdonahue commented 4 years ago

Hi @libby-h

The linked dataset will work fine with the default parameters of the training script. You might consider converting all of them to WAVs and using --data_fast_wav, but I'm not sure how much of a speed difference this will make.

I did not get around to modifying the code to handle longer clips. One thing you can try is lowering the sample rate a bit and using the max length of --data_slice_len 65536. With a sample rate of ~32k that will at least get you two seconds; probably the GAN training will result in more distortion anyway so the downsampling might not hurt too much.

Ah I see what you're saying. I haven't tried the procedure you're suggesting. I imagine that training a GAN with the most training data possible will usually produce the best results overall rather than initially overfitting to a smaller subset.

Morphing through the latent space can be a bit tricky since the clips begin and end abruptly. One thing you could do is identify looping portions (e.g., sustained moments) of the training data and train on that. Then, the GAN should also learn to produce looping segments, and you can hopefully smoothly fade between latent vectors.

Another thing to do is to turn the clips from the GAN into grains with envelopes and overlap them. I haven't tried this but for the chant music you're working with it could be a cool effect.

Good luck! Looking forward to hearing more results :)

libby-h commented 4 years ago

@chrisdonahue Thanks so much for this.

I just realised the clips I sent generated by wavegan were still actually 16k, i must have done something wrong when changing the no of samples.. will try again with 32k as you say.

Thanks for the tips with sustained moments and grains+envelopes. Lots to play with :)

libby-h commented 4 years ago

@chrisdonahue Hmm, I just set the model off again with 'python train_wavegan.py train ./train_44k_test --data_dir ./gregorian_chant_only --data_first_slice --data_slice_len 65536 --data_sample_rate 44100' on the data set I sent previously, assuming that it would create outputs at 44khz. But the preview output is generating samples at 16khz at 256kbps. Where am I going wrong with the parser arguments?

libby-h commented 4 years ago

@chrisdonahue sorted it. Also read your paper which was super useful in general. Will keep you updated with how it goes.

libby-h commented 4 years ago

Hi @chrisdonahue sharing more results https://drive.google.com/open?id=1MFQEvyPTjLRgMzrmzsUYyrEPCKjxwXFs 32k sample rate, after around 12800 iterations. Using default loaders. Data set of 10,795 items, totalling 1.3 GB (all 4 second clips of Gregorian Chanting).

It's really nice how I can hear different voices coming through in the generated clips now. Very haunting. I'm pleased so far. I read in your paper that on 5.3 hours of data (numbers 0-9) you trained for 2000k iterations, so I'll keep mine going for longer too and see what comes out next week. It's a larger data set and smaller GPU so taking longer.

cinningbao commented 3 years ago

@mattjwarren and i have been training an engine with several thousand drum machine sounds with a view to building an interface for the engine which 'makes sense' and provides a few ways to traverse the data, effectively generating audio 'morphs' from the data. the interface could also be used on GAN engines to morph pictures.. still in the early stages and the training might go through a few iterations to improve the quality.

few drum morphs and a beat constructed from a few other morphs here https://drive.google.com/drive/folders/1ETL7FZe-desY2ugQ9MSdNu7cj8Id2HdI?usp=sharing