We can start thinking about how to evaluate our models once trained. The simplest would be contrastive loss over some held out set with a fixed batch size. Another would be to evaluate how well CLAP score correlates with MOS (Mean Opinion Score), which is the gold standard subjective eval in the audio NN literature. We could probably also try linear probe training on the Google Speech Commands dataset.
We can start thinking about how to evaluate our models once trained. The simplest would be contrastive loss over some held out set with a fixed batch size. Another would be to evaluate how well CLAP score correlates with MOS (Mean Opinion Score), which is the gold standard subjective eval in the audio NN literature. We could probably also try linear probe training on the Google Speech Commands dataset.