About Magnet‘s performance

RevolGMPHL commented 5 months ago

Why is the performance so poor? Is there a bug, or is the model itself just this poor in performance?

CyberTimon commented 5 months ago

Do you mean sound quality or speed? If sound quality, yes I was expecting better too.

lonzi commented 5 months ago

You can try changing the span arrangement from 'nonoverlap' to 'stride1' - should make audio quality better. This is not the default since it was introduced only recently.
The models are sensitive to the generation parameters - you can try playing with these. Especially max_cfg_coef, decoding steps and temperature.
There is some degradation in quality in the released checkpoints (trained on 16K hours of data) compared to the models reported in our paper (trained on 20K hours of data). In https://pages.cs.huji.ac.il/adiyoss-lab/MAGNeT/ you can hear samples from the original models.
In general, the quality extremely depends on the datasets, and we encourage the community to experiment with training MAGNeT on new datasets.
It is in our plan for the near future to release also the Hybrid-MAGNeT model and the code for model rescoring (see our paper for explanations on both), which should make the results more stable.
Non-autoregressive transformers are still under-researched compared to autoregressive transformer decoders. We encourage the community to devise improvements for both the method and the open-sourced code.

CyberTimon commented 5 months ago

Thanks @lonzi for the answer! Will try out these tips tomorrow.

RevolGMPHL commented 5 months ago

thanks for the answer~ Although I tried several parameters, I still feel the performance is not as good as the original model.

CyberTimon commented 5 months ago

Can you @lonzi kindly provide us your sampling parameters or add a note in the readme with recommended paramteres?

Thank you so much!

lonzi commented 5 months ago

For Music: span_arrangement: 'stride1' use_sampling: true top-p: 0.9 temperature: 3.0 max_cfg_coef: 10.0 min_cfg_coef: 1.0 decoding_iterations (for 10 secs): [20, 10, 10, 10] decoding_iterations (for 30 secs): [60, 10, 10, 10]

See our paper for the ablation studies.

For Sound [audio-magnet models]: span_arrangement: 'stride1' use_sampling: true top-p: 0.8 temperature: 3.5 max_cfg_coef: 20.0 min_cfg_coef: 1.0 decoding_iterations: [20, 10, 10, 10]

*these are the parameters from the paper ablation studies, not necessarily tuned for the open-source models

facebookresearch / audiocraft

About Magnet‘s performance #395