JusperLee / Apollo

Music repair method to convert lossy MP3 compressed music to lossless music.
96 stars 8 forks source link

questions and suggestions about training #6

Open leapway opened 1 week ago

leapway commented 1 week ago

Thank you for bringing us this great audio restoration project. I'm training a vocal stem enhancement model on a single RTX 3090 in bfloat16 precision (which reduced the vram usage in half). In the paper you used 8 GPUs to train the model. How long did it take to train the model? Will I have to train for 8 times longer than you did because I'm using a single GPU? Also I think you forgot to remove your wandb api key from train.py

Also I recommend adding torch.cuda.empty_cache() in audio_litmodule.py, after self.validation_step_outputs.clear(), because after validation the vram usage was bigger than first epoch.

sipie800 commented 1 week ago

Thank you for the work. It's great and it fills the blank.

I want to restore some chinese songs. As far as I'm concerned the training data of apollo is about english song stems and mixing. Would it be biased towards east asia songs? I tried the inferencing and figured out that there is slight bias heard in the output wave. Generally, the western songs mixing production is somehow more powerful and dynamic than asia songs. Then the repaired chinese song has more energy in their rythm parts. Maybe the repairing can be more soft for east asia songs.

I'm downloading the origin train data and will learn it. My question is, is the stems necessary in training data if the goal is just repairing final mixing songs? Can I just train with mixture?

And about hyperparameters, will you provide some advice if I do finetuning from your checkpoint? It's finetuning but may be a deep one actually. It will be appreciated if you might advice on training recipes. Maybe I will develop some more augmentations for song repairing. But I have few experience on training an audio model.

JusperLee commented 1 week ago

Thank you for bringing us this great audio restoration project. I'm training a vocal stem enhancement model on a single RTX 3090 in bfloat16 precision (which reduced the vram usage in half). In the paper you used 8 GPUs to train the model. How long did it take to train the model? Will I have to train for 8 times longer than you did because I'm using a single GPU? Also I think you forgot to remove your wandb api key from train.py

Also I recommend adding torch.cuda.empty_cache() in audio_litmodule.py, after self.validation_step_outputs.clear(), because after validation the vram usage was bigger than first epoch.

Thank you for your feedback! The training of this model took one week, and we trained for a total of 200 epochs. You’re right, I forgot to remove the wandb API key, and I have now taken care of it. Additionally, I’ve implemented your suggestion to add torch.cuda.empty_cache() after self.validation_step_outputs.clear() to reduce VRAM usage after validation. I appreciate your input and hope these adjustments will help improve your training process!

JusperLee commented 1 week ago

Thank you for the work. It's great and it fills the blank.

I want to restore some chinese songs. As far as I'm concerned the training data of apollo is about english song stems and mixing. Would it be biased towards east asia songs? I tried the inferencing and figured out that there is slight bias heard in the output wave. Generally, the western songs mixing production is somehow more powerful and dynamic than asia songs. Then the repaired chinese song has more energy in their rythm parts. Maybe the repairing can be more soft for east asia songs.

I'm downloading the origin train data and will learn it. My question is, is the stems necessary in training data if the goal is just repairing final mixing songs? Can I just train with mixture?

And about hyperparameters, will you provide some advice if I do finetuning from your checkpoint? It's finetuning but may be a deep one actually. It will be appreciated if you might advice on training recipes. Maybe I will develop some more augmentations for song repairing. But I have few experience on training an audio model.

Thank you for your insights! There is likely some bias in the model due to stylistic differences, which can lead to different learned features. If you're inferring from mixed audio, you don't necessarily need the stems, as I aimed for the model to have some generalization ability and to expand the training data.

I recommend fine-tuning directly from the checkpoint, which will make the model more applicable to your new data without requiring extensive additional resources. However, pay close attention to the learning rate; I believe setting it to one-tenth of the original would be optimal.

Feel free to experiment with augmentations for song repairing as you progress!