Stability-AI / stable-audio-tools

Generative models for conditional audio generation
MIT License
2.73k stars 259 forks source link

Training on CLAP Embedding #156

Open javanasse opened 3 weeks ago

javanasse commented 3 weeks ago

I am trying to train the model on CLAP embedding, as described in the docs. I have updated the model_config.json to

"model": {  ...
     "conditioning": { ...
           "configs": [
                {
                    "id": "prompt",
                    "type": "clap_text",
                    "config": {
                        "clap_ckpt_path": "/path/to/music_audioset_epoch_15_esc_90.14.pt",
                        "audio_model_type": "HTSAT-base",
                        "enable_fusion": true,
                        "use_text_features": true,
                        "feature_layer_ix": -1
                    }
                },

But how should I update the get_custom_metadata method? Do I need to pass precomputed CLAP embeddings to the model? The "prompt" dictionary returned by get_custom_metadata is required to be a string. Any help would be much appreciated, the docs are a bit vague in this area.