The replicate results don't match the demo.

quizt35 commented 6 months ago

Hello! Thanks for sharing the pre-trained models and demos. I would like to replicate the demo results using a pretrained model. I used the data from the first row of the double-talk and converted the mp3 to wav format (single channel, 16000Hz, 16bit) for convenience. Based on the speech titles downloaded from the demo page, I selected the same pkl file to process the original speech. However, there is a significant difference between the spectrograms from the demo page and those generated using the pre-trained model. I've checked every steps and can't find the reason. Could you help me understand why?

model tag: v1.0.1 This code i used is below:

import os
from aec_eval import get_system_ckpt
import numpy as np
import librosa
import soundfile as sf

ckpt_dir = "v1.0.1_models/aec/"
name = "meta_aec_16_combo_rl_4_1024_512_r2"
date = "2022_10_19_23_43_22"
epoch = 110

ckpt_loc = os.path.join(ckpt_dir, name, date)

system, kwargs, outer_learnable = get_system_ckpt(
    ckpt_loc,
    epoch,
)
fit_infer = system.make_fit_infer(outer_learnable=outer_learnable)
fs = 16000

out_dir = "metaAF_output"
os.makedirs(out_dir, exist_ok=True)

u, _ = librosa.load("u.wav", sr=fs)
d, _ = librosa.load("d.wav", sr=fs)
s, _ = librosa.load("s.wav", sr=fs)
e = d - s

d_input = {"u": u[None, :, None], "d": d[None, :, None],
           "s": s[None, :, None], "e": e[None, :, None]
           }
pred = system.infer({"signals": d_input, "metadata": {}}, fit_infer=fit_infer)[0]
pred = np.array(pred[0, :, 0])

sf.write(os.path.join(out_dir, f"_out.wav"), pred, fs)

Looking forward to hearing from you, thanks!

jmcasebeer commented 6 months ago

Hello and thanks for the question.

The demo files are all rescaled to [-1, 1] for playback (see website footnote), which is not how the AEC data was setup for training. A previous github issue here noted this issue as well and rescaled d = d / 10.

If you want to replicate my results fully, I would recommend downloading the data from the AEC challenge and using that.

quizt35 commented 6 months ago

Thanks for your reply. By setting a scale, I can get a more reasonable result, but there are still some minor issues. As shown in the figure below, there are similar impulses in the first few seconds of the speech. I'm wondering if this is due to the window or the format of the original speech. I will also follow your suggestion to test on the AEC Challenge datasets.

quizt35 commented 6 months ago

Additionally, should the URL for JAX in the ‘ReadMe - GPU Setup’ be https://storage.googleapis.com/jax-releases/jax_cuda_releases.html?

adobe-research / MetaAF

The replicate results don't match the demo. #21