chrisdonahue / wavegan

WaveGAN: Learn to synthesize raw audio with generative adversarial networks
MIT License
1.32k stars 283 forks source link

How to setup? #51

Closed go-dustin closed 5 years ago

go-dustin commented 5 years ago

This is a really interesting project! I'm learning how to use Tensorflow from a Data Engineer's perspective. I'm mostly focused on scaling up training and how to deploy models to Google ML Engine for serving. So please forgive me if I'm missing anything obvious.

I've been trying to run it for the past few days but the results have been mixed. My goal isn't to produce anything realistic, exactly the opposite. I want machine generated sounds that are unique to the model. I really like challenging experimental sounds. Ultimately I want to produce some waveforms that I can load in to a wavetable synth. I'll use the synth engine to mold them into something more refined (pads, drones, FX hits, etc).

On the first attempt, I used the default 1 sec length and that gave some interesting results. I fed it 1400 random songs (thanks internet archive!) and after a few thousand (epochs?) it started producing proto-sounds. So I decided to bump up the length to 4 secs (then back down to 2). I initially used 3000 songs that experiment didn't seem to go well. I let the training run for 1600+ cycles but the model produced almost a steady tone, nothing interesting.

My second attempt, I used a 32 songs that are more abstract & atmospheric. Tonal and a little rhythmic but mostly just interesting sounds. After about 800 (epochs/cycles?), I tested the model and it produced a steady tone same as before. I'm wondering if perhaps I didn't let it train long enough before checking the model. I'm a bit unclear on how to best use this script, so I wanted to kill it as quickly as possible and fix my config. That might be a mistake due to lack of understanding.

python wavegan/train_wavegan.py train ./train --data_dir /home/jupyter/ghosts --data_normalize --wavegan_genr_pp --data_slice_len 65536 --data_overlap_ratio .2 --train_save_secs 300 --wavegan_batchnorm

I'm running a third attempt now. Same 32 songs but only 2 seconds this time. python wavegan/train_wavegan.py train ./train --data_dir /home/jupyter/ghosts --data_normalize --data_slice_len 32768 --data_overlap_ratio .2 --train_save_secs 300

I have a few questions;

In your example you're using a much more focused set (person talking, drum hits, pianos). What are your thoughts on processing long complex sounds like a song? With the assumption that the model doesn't have to produce anything that would pass as human made.

In your example data you have a directory for test/train/valid. When I try to create that structure it gives me error messages and crashes. As far as I can tell it wants the data_dir to have wav/mp3 files. So does the script break the long wav/mp3 in to slices and then splits those in to train/test/valid sets in memory?

How many epochs/cycles does this script run for? I let it run for a couple of days on my first run and I killed it when I thought I had confirmed it worked.

I read in one of the other questions I read "you will likely need to reduce the value of dim_mul or train_batch_size to ensure that the model still fits into memory.". The Google DataLab VM instance I'm using has 8CPU/30GB RAM/1xK80 GPU, I'm not getting any complaints/errors about memory. Is it OK to assume that this configuration works, even when generating longer sounds (2/4 seconds).

Last issue is a technical one. It looks like this script is hard coded to CUDA 9. Any idea on what I'd need to do to get it working on the latest drivers? In DataLab I lose the ability to access Tensorboard when I use an older version of CUDA.

go-dustin commented 5 years ago

After reading the paper. I see that it will take a while to run. At the rate I'm going probably about two weeks.

chrisdonahue commented 5 years ago

In your example you're using a much more focused set (person talking, drum hits, pianos). What are your thoughts on processing long complex sounds like a song? With the assumption that the model doesn't have to produce anything that would pass as human made.

It is generally much more challenging. I have some examples of processing fairly homogenous classical/jazz piano music on the examples website: http://chrisdonahue.com/wavegan_examples/ . However, content with more timbral variation will be much more challenging. As a very weak analogy, it was much harder to get GANs working on ImageNet than MNIST.

`

In your example data you have a directory for test/train/valid. When I try to create that structure it gives me error messages and crashes. As far as I can tell it wants the data_dir to have wav/mp3 files. So does the script break the long wav/mp3 in to slices and then splits those in to train/test/valid sets in memory?

It will split longer files into slices if you configure it to do so. See the data considerations section: https://github.com/chrisdonahue/wavegan#data-considerations

How many epochs/cycles does this script run for? I let it run for a couple of days on my first run and I killed it when I thought I had confirmed it worked.

It will run forever. I usually ended up killing it after a fixed number of steps

I read in one of the other questions I read "you will likely need to reduce the value of dim_mul or train_batch_size to ensure that the model still fits into memory.". The Google DataLab VM instance I'm using has 8CPU/30GB RAM/1xK80 GPU, I'm not getting any complaints/errors about memory. Is it OK to assume that this configuration works, even when generating longer sounds (2/4 seconds).

Yes this is fine as long as it doesn't crash.

Last issue is a technical one. It looks like this script is hard coded to CUDA 9. Any idea on what I'd need to do to get it working on the latest drivers? In DataLab I lose the ability to access Tensorboard when I use an older version of CUDA.

I don't think it's hard coded to use CUDA 9 in any particular way. It should work with Tensorflow 1.14.0 which uses CUDA 10 by default. Though it may spit out a ton of deprecation warnings