Open Jourdelune opened 4 months ago
If you don't want to make any big change to the dataloader, I suggest only provide the path to the folder of the output audio in the dataset config, then you have to manually create a json file contain the dict which maps the filename (rel_path in get_custom_metadata) of the output audio to the absolute path of the input audio for each file, and load the input audio file in the get_custom_metadata, then use e.g. CLAP to extract the audio embedding (you might need to modify the conditioner) and the rest is pretty much the same as text2audio.
To add a little to this, the embedding functionality can currently handle text or numeric data, but at the moment cannot embed any other kinds of data like audio.
The conditioning data processed by get_custom_metadata can be thought of as any data that describes or relates to features of the audio data. This could be a pitch value, a text description of the content or how many seconds into a file the section in question starts. When the model is trained, conditioning data is used to generate a new file - eg give the model a text description and it will output audio that matches that text based on the dataset it was trained on.
It's a bit unclear what you hope to achieve (style transfer? - this is done in another way) by using an audio file as conditioning data and how one audio file would be used to 'describe' features in another.
CLAP (Contrastive Language-Audio Pretraining) can be used to generate descriptions for unlabeled audio, which can be helpful for some tasks. However, its effectiveness depends on the data it was trained on.
As @BingliangLi mentions, you could also use CLAP to get embeddings from your audio files. These embeddings can be thought of as an abstract representation of the audio data and are stored as a numpy.darray. Using these embeddings as conditioning data involves handling a lot of information, and either modifying this repo significantly or using a hacky approach. e.g. each value in the array (audio embeddings from clap are numpy data arrays) would need to be treated as a separate conditioning value, which isn't ideal since each value would be treated individually. You would also need to determine appropriate values for parameters like embed_dim and cond_token_dim to actually get this much data to be used in a meaningful way.
Overall, the relationship between the input audio file and the output audio file you want to use for conditioning is not clear. Maybe if you describe what you want to do, rather than the issue you encounter with your approach to a solution, you can get some good advice.
Thank you for your comment. I'm trying to create a model that improves the quality of audio (music, voice, etc.). My input is a low-quality sound, and the output is the same sound in high-quality. I will see if I can use CLAP, but the latent space needs to contain all the information for the decoder to generate the audio.
I don't think CLAP will be fit for this job, maybe you should check AudioLDM, their model can be used to perform audio super resolution, you may find some inspiration there: https://audioldm.github.io/
thank you for the recommendations, I will see that! I hope one day stable-audio-tools will also be able to support this type of task^^.
Hey, I want to train a model on an audio-to-audio task (a wav file provided as an input and another wav file provided for the output). Do you have any idea how to configure this with the dataset configuration? I can provide the directory of one audio directory, but how can I provide two different directories (the filenames are the same between the repositories for the input and the output)? Any help is welcome!
My directory configuration is:
input_audio/
output_audio/
where input_audio/audio1.wav is the input and output_audio/audio1.wav the target.