aleXiehta / PhoneFortifiedPerceptualLoss

Improving Perceptual Quality by Phone-Fortified Perceptual Loss using Wasserstein Distance for Speech Enhancement
MIT License
68 stars 16 forks source link

Some General Doubts related to Implementation #1

Open raikar8 opened 3 years ago

raikar8 commented 3 years ago

Dear Tsun-An, I am reading your paper on significance of phonetic information on Speech Enhancement. The paper is really well written and has clarity on the experiments conducted. Though I have some confusions related to the paper and the code you provided:

1) What do you think it will have an impact on the performance on ASR ? (I mean the speech enhanced using phonetic loss model in comparison with other methods).

2) Is the pretrained model the same you used in the Paper for calculating objective scores ?

3) Can you please provide an example template for dataset.py ?

4) In the paper it's written " A report of the full comparisons and analyses can be accessed on our GitHub page", are you planning to add more results in this git repository in future ?

Thanks, Aditya Raikar

aleXiehta commented 3 years ago

Dear Aditya Raikar,

I am grateful that you are interested in our work.

  1. It is possible that using PFP loss for SE could be helpful for downstream ASR tasks, but we have not tested that yet. We care about whether our system can support ASR systems. Please tell us if you have any progress.

  2. The given pre-trained model is optimized on PHP loss + MAE (with the result of PESQ 3.11).

  3. To help anyone who would like to implement our system, we have uploaded a template for data loading. Thanks for reminding me. Please note that our implementation of the normalization function may differ from some other approaches.

  4. We are planning to illustrate the correlation of each metric and PFP loss, but we are facing some version conflict issues that are not solved yet.

Thanks again, Tsun-An

raikar8 commented 3 years ago

Thanks Tsun-An for your reply in detail.

1) Since you are using some kind of Phonetic Loss, I think it will work on ASR too. Have not tried it, but it will be interesting to see on it's performance. Will let you know once I am done.

2) I want to test the model on noisy .wav files which I have. The current version of dataset.py only does denoising on a chunk of file, which I can understand from the following code:

def __getitem__(self, idx):
    clean = torchaudio.load(self.clean_path[idx])[0]
    noisy = torchaudio.load(self.noisy_path[idx])[0]

    noisy = self.normalize(noisy)
    length = clean.size(-1)
    clean.squeeze_(0)
    noisy.squeeze_(0)
    start = torch.randint(0, length - 16384 - 1, (1, ))
    end = start + 16384
    clean = clean[start:end]
    noisy = noisy[start:end]

    return noisy, clean

Wanted to ask if I can replace the upper code with this one:

def __getitem__(self, idx):
    clean = torchaudio.load(self.clean_path[idx])[0]
    noisy = torchaudio.load(self.noisy_path[idx])[0]

    noisy = self.normalize(noisy)
    length = clean.size(-1)
    clean.squeeze_(0)
    noisy.squeeze_(0)

    return noisy, clean

Basically I am eliminating the start and end part, should it be able to denoise the whole file now ? Or there is something else we need to do ? With current version of generate.py I am getting small chunks of denoised audio. Or the original dataset.py is written for training ?

Thanks, Aditya Raikar

aleXiehta commented 3 years ago

Wanted to ask if I can replace the upper code with this one:

def __getitem__(self, idx):
    clean = torchaudio.load(self.clean_path[idx])[0]
    noisy = torchaudio.load(self.noisy_path[idx])[0]

    noisy = self.normalize(noisy)
    length = clean.size(-1)
    clean.squeeze_(0)
    noisy.squeeze_(0)

    return noisy, clean

Basically I am eliminating the start and end part, should it be able to denoise the whole file now ? Or there is something else we need to do ? With current version of generate.py I am getting small chunks of denoised audio. Or the original dataset.py is written for training ?

Thanks, Aditya Raikar

To generate the full enhanced utterance with the same length of its input, you need to use the padding function in dataset.py.

clean = torchaudio.load(self.clean_path[idx])[0]
clean = self.padding(clean).squeeze(0)

As you can see in generate.py, the length of the output utterance is truncated so as to match the original length.

torchaudio.save(os.path.join(save_root, filename), e[:, :l].cpu(), 16000)
raikar8 commented 3 years ago

Thanks again, so we can train on any custom dataset using the settings you have given in the original dataset.py ?

def __getitem__(self, idx):
    clean = torchaudio.load(self.clean_path[idx])[0]
    noisy = torchaudio.load(self.noisy_path[idx])[0]

    noisy = self.normalize(noisy)
    length = clean.size(-1)
    clean.squeeze_(0)
    noisy.squeeze_(0)
    start = torch.randint(0, length - 16384 - 1, (1, ))
    end = start + 16384
    clean = clean[start:end]
    noisy = noisy[start:end]

    return noisy, clean
raikar8 commented 3 years ago

Also It throws an error "IndexError:tuple index out of range" when I run generate.py, and points to "l = torch.LongTensor([b[2] for b in batch])" in the utils.py.

Replacing that with "l = torch.LongTensor([len(b[0]) for b in batch])", solves it. Please confirm/