gabrielmittag / NISQA

NISQA - Non-Intrusive Speech Quality and TTS Naturalness Assessment
MIT License
663 stars 117 forks source link

Continuous metrics? #31

Closed youssefabdelm closed 1 year ago

youssefabdelm commented 1 year ago

Hi Gabriel, thanks for making such a useful model!

I have 2 files: 1 denoised, and another noisy. In some cases when the quality drops below a certain threshold, I'd like it to move to whichever has the highest quality so to speak (excluding noisiness).

However, I'd like to do this in a smooth way. So my question is, is it possible to export continuous metrics?

Would I be right in my estimation that y_hat_list is a list of metrics? If so, how could I map this back onto number of samples? Or would it be even accurate/advisable to do so in my case?

def predict_dim(model, ds, bs, dev, num_workers=0):     
    '''
    predict_dim: predicts MOS and dimensions of the given dataset with given 
    model. Used for NISQA_DIM model.
    '''        
    dl = DataLoader(ds,
                    batch_size=bs,
                    shuffle=False,
                    drop_last=False,
                    pin_memory=False,
                    num_workers=num_workers)
    model.to(dev)
    model.eval()
    with torch.no_grad():
        y_hat_list = [ [model(xb.to(dev), n_wins.to(dev)).cpu().numpy(), yb.cpu().numpy()] for xb, yb, (idx, n_wins) in dl]
    yy = np.concatenate( y_hat_list, axis=1 )

    y_hat = yy[0,:,:]
    y = yy[1,:,:]

    ds.df['mos_pred'] = y_hat[:,0].reshape(-1,1)
    ds.df['noi_pred'] = y_hat[:,1].reshape(-1,1)
    ds.df['dis_pred'] = y_hat[:,2].reshape(-1,1)
    ds.df['col_pred'] = y_hat[:,3].reshape(-1,1)
    ds.df['loud_pred'] = y_hat[:,4].reshape(-1,1)

    return y_hat, y
gabrielmittag commented 1 year ago

Hi, I am not sure if I understand correctly. The model outputs the predicted overall MOS and the four quality dimensions for each input audio sample. What do you mean exactly by continuous?

youssefabdelm commented 1 year ago

Hi, I am not sure if I understand correctly. The model outputs the predicted overall MOS and the four quality dimensions for each input audio sample. What do you mean exactly by continuous?

Oh sorry I was not clear. By continuous I simply meant whether it can export the four quality dimensions for every audio sample (as opposed to the overall metrics for the entire (say 30 sec) file). So if there's a 16kHz file, it would export the metrics for every one of those 16000 samples. (Similar to: https://github.com/vvvm23/stoi-vqcpc)

I was curious what variable should I look at to extract that array of metrics for every audio sample (instead of overall / average score?)

gabrielmittag commented 1 year ago

Oh I see - that is not possible, at least not with the model architecture that the pretrained models are using. You could probably train a new model where you restrict the output size to 1 before pooling and use the pre-pooling output as a 'continuous' score.

youssefabdelm commented 1 year ago

Ah I see, thank you!