If Dasheng is used in the diffusion model of the audio field, is there something magical happening?

RicherMans / Dasheng

Source for the Interspeech 2024 Paper "Scaling up masked audio encoder learning for general audio classification"

Apache License 2.0

44 stars 3 forks source link

If Dasheng is used in the diffusion model of the audio field, is there something magical happening? #1

Closed yuyun2000 closed 5 months ago

RicherMans commented 5 months ago

Hey, sorry I don't get the title nor the question.

yuyun2000 commented 5 months ago

I was wrong. I thought he was a VAE like model that could be used in diffusion models, but he doesn't have the ability to restore audio

RicherMans commented 5 months ago

Hey, so technically he can, but I don't provide the decoder here. VAE and MAE are both autoencoders, so that's that. If you want you could just attach a decoder ontop of the model and train your own latents.

RicherMans commented 5 months ago

Btw, maybe for your info, but SemantiCodec is a codec based on MAE, which shows promosing performance. I would assume that Dasheng greatly outperforms their AudioMAE. Further Frepainter has shown that MAE based approaches can outperform diffusion for super-resolution.

In both cases you would just need to attach a decoder on your dasheng features and train. I believe you can give it a try :) Kind regards, Heinrich

yuyun2000 commented 5 months ago

I am just a beginner, but if given the opportunity, I will definitely use Dasheng to showcase its capabilities. :)