asteroid-team / asteroid

The PyTorch-based audio source separation toolkit for researchers
https://asteroid-team.github.io/
MIT License
2.24k stars 421 forks source link

Problem of Result and wham wav. #56

Closed dakenan1 closed 4 years ago

dakenan1 commented 4 years ago

Hi, Nice try in Conv-Tasnet and WSJ0 experiment! But there is a few question i am confuse about cause i can't get 12.7dB in WHAM which paper suppose. So i would like to know:

  1. Is there any difference between wsj0 and wham! when separated by convtasnet?
  2. I found same wav in WSJ0 and WHAM have different bit. The one in WHAM is usually twice bigger as the one is wsj0. Is there any reson about it? Thanks for your answer!
dakenan1 commented 4 years ago

the difference about dataset is like 'source number change 2 to 3( one for noise)?' or another work after put sample in Tasnet model

mpariente commented 4 years ago

Hi I have some answers and also some questions to you because I didn't understand everything.

But there is a few question i am confuse about cause i can't get 12.7dB in WHAM which paper suppose.

Which paper exactly? I can't find the 12.7dB you're referring to in the WHAM paper.

  1. Is there any difference between wsj0 and wham! when separated by convtasnet?

The sep_clean task in WHAM is considered to be strictly equivalent to wsj0-mix. Does this answer your question?

  1. I found same wav in WSJ0 and WHAM have different bit. The one in WHAM is usually twice bigger as the one is wsj0. Is there any reson about it?

The examples differ between dataset, I agree. I believe the scaling in WHAM is different get closer to real recordings. But that's not my choice, I strictly follow the script they provided to generate the data.

the difference about dataset is like 'source number change 2 to 3( one for noise)?' or another work after put sample in Tasnet model

I'm really not sure what you're asking here, if you are referring to the whamr folder, it's the place for a new dataset : an extension of WHAM to reverberant conditions, hence WHAMR. The paper is here

Most importantly, what results are you trying to replicate, with which architecture and on which task?

dakenan1 commented 4 years ago

Ah, thanks for your answering.

  1. Actually, i read your paper 'FILTERBANK DESIGN FOR END-TO-END SPEECH SEPARATION' and try to reproduce your work in speech separation with Conv-Tasnet from WHAM!( which the 12.7dB come from) Now i am training the model and get si-sdr about 9dB in 50 epoch, which keeps the same parameters as Conv-Tasnet paper. So i want to find out the reason why i can't get 12.7dB, is it any modification in Conv-Tasnet while separating speech from WHAM? (for example, set source numbers as 3, 2 speech + 1 noise)
  2. The size of each wav file is one of my discovery, i guess it is because the recording of noise are two channel, but i am not sure whether there is influence to result. You can forget about it if you didn't noticed the difference.
mpariente commented 4 years ago

OK, this is clearer thanks.

Could you please copy paste here the conf.yml file which is in your exp folder (not in local but in exp/train...)? I'll be able to help you better after that Thx

dakenan1 commented 4 years ago

My project has upload and you can see more detail in: https://github.com/dakenan1/Conv-Tasnet-Arxiv/blob/master/scripts/run_tasnet.sh

!/bin/bash

set -euo pipefail

lr="1e-3" data_dir="data" #mirana's dir dconv_norm_type='gLN' active_func="relu" date=$(date "+%Y%m%d") causal="false" savename="tasnet${date}_${activefunc}${dconv_normtype}${lr}" mkdir -p exp/${save_name}

num_gpu=6 #mirana's gpu batch_size_single_gpu=4 batch_size=$[num_gpu*batch_size_single_gpu] CUDA_VISIBLE_DEVICES='2,3,4,5,6,7' python -u steps/run_tasnet.py \ --decode="false" \ --batch-size=${batch_size} \ --learning-rate=${lr} \ --weight-decay=1e-5 \ --epochs=50 \ --data-dir=${data_dir} \ --modelDir="exp/${save_name}" \ --use-cuda="true" \ --autoencoder-channels=256 \ --autoencoder-kernel-size=20 \ --bottleneck-channels=256 \ --convolution-channels=512 \ --convolution-kernel-size=3 \ --number-blocks=8 \ --number-repeat=4 \ --number-speakers=2 \ --normalization-type=${dconv_norm_type} \ --active-func=${active_func} \ --causal=${causal}

mpariente commented 4 years ago

Hmm, if I understand well, you don't use Asteroid to train your model, is this right? I cannot really help if it's not the case. You want to replicate the 12.7dB improvement on 8khz min WHAM dataset, with the separate-noisy task right? Do you train on the separate-noisy task?

I advise you to run the recipe in asteroid with --task sep_noisy. If you have issues with that, I'll be able to help

dakenan1 commented 4 years ago

Thanks for your patience! As you have said, i am trying to finish the separate-noisy task using Conv-Tasnet. I separate the mixture from WHAM in the same way as WSJ0, but i just get SDR 9dB while 16dB in WSJ0, far away form your 12.7 dB. The ONLY one thing i make WHAM different to WSJ0 is generating data step. So i am trying to find the difference between your method and mine. I have read your code of separate-noisy task in Asteroid, and your Asteroid code acts like the way same as Yi Luo's model while it dealing with WHAM( train 200 epoch without pre-train by light TCN). So there is really confuse me that my experiment result just get 9dB. And I noticed you describe your experiments using full TCN after some epochs of light TCN in your paper, which means there are two stages in separate-noisy task in some way. Do you use this method in separate-noisy task, or just like Conv-Tasnet' way? It is important to me if you can give me some advice. ~^-^~

dakenan1 commented 4 years ago

By the way , have you tried to replicate FurcaPy or FurcaPa(https://www.isca-speech.org/archive/Interspeech_2019/pdfs/1373.pdf), which come from Zhiqiang Shi's paper and revised model from Conv-Tasnet? I can not reach the supposed result(18.2dB and 18.3 dB) , too. It really trouble me .

mpariente commented 4 years ago

If you want to reach the 12.7dB on the noisy separation task, you have to train it on the noisy separation task. It is normal that a model only trained for clean separation doesn't generalize to noisy separation.

And I noticed you describe your experiments using full TCN after some epochs of light TCN in your paper, which means there are two stages in separate-noisy task in some way. Do you use this method in separate-noisy task, or just like Conv-Tasnet' way?

There are not two steps. The final experiments were done in one pass i.e training each model directly. The experiments with LightTCN were just to gain insight into the importance of each parameter.

By the way , have you tried to replicate FurcaPy or FurcaPa(https://www.isca-speech.org/archive/Interspeech_2019/pdfs/1373.pdf), which come from Zhiqiang Shi's paper and revised model from Conv-Tasnet?

No, we haven't yet. We are very happy to receive contributions to support them. Would you like to contribute by integrating them in Asteroid?

dakenan1 commented 4 years ago

If you want to reach the 12.7dB on the noisy separation task, you have to train it on the noisy separation task. It is normal that a model only trained for clean separation doesn't generalize to noisy separation. No, we haven't yet. We are very happy to receive contributions to support them. Would you like to contribute by integrating them in Asteroid?

What_ is the 'train it on the noisy separation task ' meaning ? I trained my model on WHAM!(Not WSJ0) as trainning dataset , and eval in noisy separation task. It's any difference with you? Ah, it is good idea. I will try to integrate my implementation to your project if i replicate the furcX's result.

mpariente commented 4 years ago

WHAM! is a dataset that has 4 tasks (please see the paper). Noisy separation is the task of estimating the two clean sources given the mixture of these two sources and noise (find s1 and s2 from s1+s2+n). In the WHAM! folder structure, the mixture you want is mix_both and estimate s1 and s2 with it.

Could you try to replicate the results with Asteroid, at least I'll be sure what's happening between training and evaluation.

You can open a PR when you want, when it's tested and the results are replicated, we can integrate in into core.

mpariente commented 4 years ago

Hi, did you manage to replicate the results with Asteroid?

mpariente commented 4 years ago

Closing this. Feel fre to re-open it if you have problems with Asteroid code.