Open leapway opened 1 month ago
Thank you for the work. It's great and it fills the blank.
I want to restore some chinese songs. As far as I'm concerned the training data of apollo is about english song stems and mixing. Would it be biased towards east asia songs? I tried the inferencing and figured out that there is slight bias heard in the output wave. Generally, the western songs mixing production is somehow more powerful and dynamic than asia songs. Then the repaired chinese song has more energy in their rythm parts. Maybe the repairing can be more soft for east asia songs.
I'm downloading the origin train data and will learn it. My question is, is the stems necessary in training data if the goal is just repairing final mixing songs? Can I just train with mixture?
And about hyperparameters, will you provide some advice if I do finetuning from your checkpoint? It's finetuning but may be a deep one actually. It will be appreciated if you might advice on training recipes. Maybe I will develop some more augmentations for song repairing. But I have few experience on training an audio model.
Thank you for bringing us this great audio restoration project. I'm training a vocal stem enhancement model on a single RTX 3090 in bfloat16 precision (which reduced the vram usage in half). In the paper you used 8 GPUs to train the model. How long did it take to train the model? Will I have to train for 8 times longer than you did because I'm using a single GPU? Also I think you forgot to remove your wandb api key from train.py
Also I recommend adding torch.cuda.empty_cache() in audio_litmodule.py, after self.validation_step_outputs.clear(), because after validation the vram usage was bigger than first epoch.
Thank you for your feedback! The training of this model took one week, and we trained for a total of 200 epochs. You’re right, I forgot to remove the wandb API key, and I have now taken care of it. Additionally, I’ve implemented your suggestion to add torch.cuda.empty_cache()
after self.validation_step_outputs.clear()
to reduce VRAM usage after validation. I appreciate your input and hope these adjustments will help improve your training process!
Thank you for the work. It's great and it fills the blank.
I want to restore some chinese songs. As far as I'm concerned the training data of apollo is about english song stems and mixing. Would it be biased towards east asia songs? I tried the inferencing and figured out that there is slight bias heard in the output wave. Generally, the western songs mixing production is somehow more powerful and dynamic than asia songs. Then the repaired chinese song has more energy in their rythm parts. Maybe the repairing can be more soft for east asia songs.
I'm downloading the origin train data and will learn it. My question is, is the stems necessary in training data if the goal is just repairing final mixing songs? Can I just train with mixture?
And about hyperparameters, will you provide some advice if I do finetuning from your checkpoint? It's finetuning but may be a deep one actually. It will be appreciated if you might advice on training recipes. Maybe I will develop some more augmentations for song repairing. But I have few experience on training an audio model.
Thank you for your insights! There is likely some bias in the model due to stylistic differences, which can lead to different learned features. If you're inferring from mixed audio, you don't necessarily need the stems, as I aimed for the model to have some generalization ability and to expand the training data.
I recommend fine-tuning directly from the checkpoint, which will make the model more applicable to your new data without requiring extensive additional resources. However, pay close attention to the learning rate; I believe setting it to one-tenth of the original would be optimal.
Feel free to experiment with augmentations for song repairing as you progress!
Thank you for the work. It's great and it fills the blank. I want to restore some chinese songs. As far as I'm concerned the training data of apollo is about english song stems and mixing. Would it be biased towards east asia songs? I tried the inferencing and figured out that there is slight bias heard in the output wave. Generally, the western songs mixing production is somehow more powerful and dynamic than asia songs. Then the repaired chinese song has more energy in their rythm parts. Maybe the repairing can be more soft for east asia songs. I'm downloading the origin train data and will learn it. My question is, is the stems necessary in training data if the goal is just repairing final mixing songs? Can I just train with mixture? And about hyperparameters, will you provide some advice if I do finetuning from your checkpoint? It's finetuning but may be a deep one actually. It will be appreciated if you might advice on training recipes. Maybe I will develop some more augmentations for song repairing. But I have few experience on training an audio model.
Thank you for your insights! There is likely some bias in the model due to stylistic differences, which can lead to different learned features. If you're inferring from mixed audio, you don't necessarily need the stems, as I aimed for the model to have some generalization ability and to expand the training data.
I recommend fine-tuning directly from the checkpoint, which will make the model more applicable to your new data without requiring extensive additional resources. However, pay close attention to the learning rate; I believe setting it to one-tenth of the original would be optimal.
Feel free to experiment with augmentations for song repairing as you progress!
Hi, I'm starting finetuning apollo from my wav files, which are 80 files including 300min music mixtures. Can you provide utils for data preprocessing? In your repo, your data were processed into HDF5 file before training. Or can you provide some tips if I may just load these wav files directly with your implementation of MusdbMoisesdbDataset ?
If they can't be used instantly, please point out the unavoidable preprocessing steps. I believe source activity detection is not neccessary for me. Rescaling and downsampling is vital. Can you provide a start point script or something please.
Besides, you mention the learning rate should be one-tenth, is that for both G and D?
Thanks.
And does the released checkpoint include discriminator? If not, will it be weird to initialize the model with only the generator ?
Thank you for your questions and insights!
Data Preprocessing: In my setup, I used a Voice Activity Detection (VAD) algorithm, with energy-based thresholds, to split the audio into 6-second segments, which were then saved to an HDF5 file. You are free to use other formats as long as your dataset outputs both the original audio and the codec-compressed audio. If you'd like, I can provide a script for this preprocessing, but it should be straightforward if you're already familiar with handling WAV files.
Using Mixtures for Training: Yes, you can train using mixtures directly. The stems are not strictly necessary if your goal is just to repair mixed songs. The model should generalize sufficiently to perform well with mixed audio.
Hyperparameters: Here is the configuration for the learning rates of the generator (G) and discriminator (D) that I used:
optimizer_g:
_target_: torch.optim.AdamW
lr: 0.001
weight_decay: 0.01
optimizer_d:
_target_: torch.optim.AdamW
lr: 0.0001
weight_decay: 0.01
betas: [0.5, 0.99]
If you're fine-tuning, I recommend starting with a learning rate that is one-tenth of these values. Yes, this adjustment applies to both G and D.
Let me know if you need further assistance!
Here are some key topics in my mind:
Have you got any insight about scaling in this paradigm? By the size of checkpoint, it's a small model. The 2 datasets you used contain about 30h audio. And you train for 200e with early stopping. I might get 30h or more audio for my task.
If I want higher accuracy and fidelity with the repaired result which hyperparameter should I pay most attention? Will it be the hidden dim (feature_dim in your config)? I guess the layer depth is not really so important than the hidden dim because the repairing is about the tiny audio texture in top high frequency?
will you tell me that default dim 256 is just very enough for high accuracy repairing? Or will high dim introduce VRAM issues? I just use 24GB cards right now.
Actually the audio loss is not just single time loss mp3 encoding. For example, there might be old music mp3 files in internet which have been encoded for several times during long time circulation. I might try simulating that in data augmentation. And there are loss audio like vinyl record or tape audios, they are much more difficult to simulate but we can make rough simulation. Do you concern these situation or any idea to do it without the precise simulation? If possible it will be a great wide application.
Thank you for raising these insightful points!
Scaling the Model: You could certainly scale up the model to match the larger dataset you have. Increasing the model size may help to capture more complex details in the audio, especially if your dataset grows substantially beyond the current 30 hours.
Hyperparameter Importance: I believe both depth and width are crucial for improving fidelity and accuracy. You might want to experiment with the hidden dimensions as well as the number of layers. The Roformer architecture, for example, balances both depth and width effectively. Paying attention to its scaling strategy might be insightful.
Hidden Dim Considerations: In my experience, the default hidden dimension of 256 works well for high-accuracy audio repair tasks. However, increasing it can enhance capacity at the cost of higher VRAM consumption. With a 24GB card, there is some room to experiment with larger dimensions, though you'll need to balance it carefully to avoid running into memory issues.
Multiple Encoding Loss & Generalization: You bring up an excellent point about multiple encodings over time. We didn't explicitly consider this in our model, but it's likely beneficial to simulate these effects in data augmentation to improve generalization. Simulating repeated lossy encoding or artifacts like those from vinyl records or tapes could help the model learn to deal with a wider range of degradation, which would be quite valuable in real-world applications.
Hope this helps! Let me know if you have further questions or if you'd like to discuss specific implementation details.
Thank you for bringing us this great audio restoration project. I'm training a vocal stem enhancement model on a single RTX 3090 in bfloat16 precision (which reduced the vram usage in half). In the paper you used 8 GPUs to train the model. How long did it take to train the model? Will I have to train for 8 times longer than you did because I'm using a single GPU? Also I think you forgot to remove your wandb api key from train.py
Also I recommend adding torch.cuda.empty_cache() in audio_litmodule.py, after self.validation_step_outputs.clear(), because after validation the vram usage was bigger than first epoch.