Open thomasgauthier opened 1 year ago
For zero-shot, the only way is to postprocess the inferenced output audio. For training a classifier, according to the model structure, maybe you should use the separationNet's feature? But I did not find the keyword in the repo. And to be honest, the zero-shot performance is really depend on your data based on my tries on this model...
I want to detect if an audio sample contains laughing, can this model help me do that?