facebookresearch / audiocraft

Audiocraft is a library for audio processing and generation with deep learning. It features the state-of-the-art EnCodec audio compressor / tokenizer, along with MusicGen, a simple and controllable music generation LM with textual and melodic conditioning.
MIT License
21.05k stars 2.17k forks source link

Watermark model slow training (cross-posted from facebookresearch/audioseal) #484

Open christianc102 opened 3 months ago

christianc102 commented 3 months ago

Hi!

(This was cross-posted at facebookresearch/audioseal, but wanted to also put here for visibility--thanks!)

Thanks so much for the helpful training code and documentation. Apologies in advance for the naive question--I'm pretty new to machine learning.

I'm trying to train my own watermarking model at 48kHz with my own dataset on an H100 node with 8 GPUs (H100 80GB HBM3) on a remote SLURM cluster, but as I scale the batch size the training speed appears to drop proportionally. There also appears to be an unexpected behavior where I specify dataset.batch_size=k but the submitted config (logged by wandb) shows dataset.batch_size=k/8.

As an example, I ran experiments setting dataset.batch_size=8, which became dataset.batch_size=1, yielding a max training speed of about 1.67 steps / second and GPU utilization reaching averaging around 25%. When I set dataset.batch_size=128 (to yield dataset.batch_size=16), training speed dropped to around 0.3 steps / second. It seems to me that parallelization isn't working the way it should based on these results?

I've tried preprocessing my dataset to one-second clips and removing some of the augmentations (even running an experiment with only noise augmentations) to try to increase GPU utilization, but nothing I've tried has improved the training speed.

Is this to be expected? Roughly how long did the original AudioSeal model take to train, using what amount of compute?

Thank you so much!

hadyelsahar commented 3 months ago

Hi! , can you paste here your run command so i am sure you are doing it right?

As an example, I ran experiments setting dataset.batch_size=8, which became dataset.batch_size=1, yielding a max training speed of about 1.67 steps / second and GPU utilization reaching averaging around 25%. When I set dataset.batch_size=128 (to yield dataset.batch_size=16), training speed dropped to around 0.3 steps / second. It seems to me that parallelization isn't working the way it should based on these results?

this seems normal to me the batch_size you add as an argument is the effective batch size, is internally divided between all gpus. If i understand correctly it is normal for step/sec to drop if you increase batch size because the step now has more samples to compute.

have you tried to plot convergence curves between the bsz?

Roughly how long did the original AudioSeal model take to train

Original training took from 3-10 days to to obtain good results on 4 gpus machine. But after 20-40 hours you could see it converging already.

Comedian1926 commented 3 months ago

@hadyelsahar Hello, thank you very much for your work. Are there more details about the training? The 400k Voxpopuli dataset is too large for me. I hope to verify the watermarking effects on a smaller dataset. In fact, I have trained for about 10 epochs on a 200-hour dataset, but there is no effect. So I would like to know the minimum effective dataset size in terms of hours. Thank you again.

hadyelsahar commented 3 months ago

but there is no effect.

it will help a lot if you can share your evaluation metrics, you can find them in the dora log directory ./history.json

The 400k Voxpopuli dataset is too large for me.

Note here that in AudioCraft epoch is just a predefined # of steps not the whole training data, we set the default = 2000 steps . so the size of your training data basically doesn't affect the time taken per epoch it just affects the pool of samples that your training comes from.

https://github.com/facebookresearch/audiocraft/blob/adf0b04a4452f171970028fcf80f101dd5e26e19/config/solver/watermark/default.yaml#L193

We don't use the full 400k hours on vox populi we select 5k hours, with which you can find good performance in about 80-100 epochs, we let our run till 200-300 epochs.

pierrefdz commented 3 months ago

I think the training could be made a bit more efficient indeed, but we have not focused on it that much...

Hello, thank you very much for your work. Are there more details about the training? The 400k Voxpopuli dataset is too large for me. I hope to verify the watermarking effects on a smaller dataset. In fact, I have trained for about 10 epochs on a 200-hour dataset, but there is no effect. So I would like to know the minimum effective dataset size in terms of hours. Thank you again.

@Comedian1926 , if you want to study the watermark training at a smaller scale, what you can do is focus on some augmentations, and remove the compression ones -- for them, we need to transfer to CPU, save with the new format, load, and transfer back to GPU, so they take a lot of time.

What we observed during training is that the detection (and localization) accuracy increases very fast, in 10 epochs or even less. For the rest of the epochs, all metrics increase at a steady rate (notably the audio quality metrics). Here is an example of some of the validation metrics (each point here is for 10 epochs since we computed validation metrics every 10epochs -- so 20 means 200 epochs). 308475502-618403f4-5654-4892-9fe1-ebf12b55f3a5.

Comedian1926 commented 3 months ago

@pierrefdz @hadyelsahar Thank you very much for your reply, it is very useful to me. My previous training mainly had the d_loss between 1.98 and 2, and I feel it did not converge. I am currently restarting the training and will synchronize the log to you, hoping to succeed. Thank you again for your work~

Comedian1926 commented 3 months ago

@hadyelsahar @pierrefdz Hello, I've trained another model, but it still doesn't seem to be converging My hardware is 3090 x 2 The training data is Voxpopuli subset 10k en I've also experimented with adjusting the learning rate and batch size using a single card, but it didn't yield satisfactory results This is the hyperparameter and log for training history.json hyperparams.json spec_7 (2).pdf I appreciate any advice you can offer. Thank you in advance.

zjcqn commented 3 weeks ago

I found that the pesq operation in [audiocraft/solvers/watermark.py] is very time-consuming, so I skipped it.

zjcqn commented 2 weeks ago

@hadyelsahar @pierrefdz Hello, I've trained another model, but it still doesn't seem to be converging My hardware is 3090 x 2 The training data is Voxpopuli subset 10k en I've also experimented with adjusting the learning rate and batch size using a single card, but it didn't yield satisfactory results This is the hyperparameter and log for training history.json hyperparams.json spec_7 (2).pdf I appreciate any advice you can offer. Thank you in advance.

I have encountered a similar issue where my training results are not converging. Specifically, d_loss remains close to 2.0, and wm_mc_identity is around 0.693, indicating an accuracy of only 0.5 and rendering the detector completely ineffective. Even removing all augmentations does not resolve the problem.

Has anyone found a suitable solution? I would be extremely grateful for any useful suggestions.

pierrefdz commented 2 weeks ago

I'd suggest to first try to make things work without any perceptual losses, and see if you manage to make the bit accuracy and the detection go up. Something like:

# all the defaults form compression
losses:
  adv:0.0
  feat: 0.0
  l1: 0.0
  mel: 0.0
  msspec: 0.0
  sisnr: 0.0
  wm_detection: 1.0 # loss for first 2 bits cannot be 0 
  wm_mb: 1.0  # loss for the rest of the bits (wm message)
  tf_loudnessratio: 0.0

Then add the rest little by little and adapt the optimization parameters to ensure that the training is able to start. (Sometimes the training stays frozen depending on the hyperparam. If you see that it does not take off you can cut the run very fast.)

zjcqn commented 2 weeks ago

I'd suggest to first try to make things work without any perceptual losses, and see if you manage to make the bit accuracy and the detection go up. Something like:

# all the defaults form compression
losses:
  adv:0.0
  feat: 0.0
  l1: 0.0
  mel: 0.0
  msspec: 0.0
  sisnr: 0.0
  wm_detection: 1.0 # loss for first 2 bits cannot be 0 
  wm_mb: 1.0  # loss for the rest of the bits (wm message)
  tf_loudnessratio: 0.0

Then add the rest little by little and adapt the optimization parameters to ensure that the training is able to start. (Sometimes the training stays frozen depending on the hyperparam. If you see that it does not take off you can cut the run very fast.)

Thank you for your detailed reply. Your advice is very useful in diagnosing the issue, which now seems to be resolved. I suspect that the primary problem I faced may be due to a lack of adequate training duration, as wm_mb_identity started to exhibit a noticeable decline after 25 epochs. Moreover, I made slight modifications to the temperature setting of wm_mb_loss to 1, which led to a substantial enhancement in the convergence rate. The watermark detection function is now yielding favorable results, and the performance of multi-band message detection continues to improve. I am very grateful for your assistance.

image image