Enny1991 / beamformers

Easy to use Beamformers for multi-channel speech separation/enhancement
MIT License
184 stars 48 forks source link

how do you get mix.wav and nn.wav #1

Open zcy618 opened 4 years ago

zcy618 commented 4 years ago

hi dear friend: I read your code, and I get one question that how do you make mix.wav and nn.wav, for example: :param mixture: nd_array (n_mics, time) of the mixture recordings :param noise: nd_array (n_mics, time) of the noise recordings :param target: nd_array (n_mics, time) of the target recordings :param frame_len: int (self explanatory) :param frame_step: int (self explanatory) :return: the enhanced signal

actually, if we could know the noise, we surely could do much, so how do you make it please? Thanks.

Enny1991 commented 4 years ago

Hi zcy, I am not sure I understand the question, the wav files mix.wav and 'nn.wav' are simple examples that you can find in the wavs folder. In general, the param mixture is the noisy signal recorded from you microphones, the param noise is a sample of the noise recorded from your microphones that is affecting the mixture and target is a sample of the target you want to extract from the mixture recorded from the microphones. You can get all of these recordings separately, the noise and target are simply used to estimate the beamformers they are not the ground truth values.

Let me know if this answers your question!

Cheers, +Enea

dyustc commented 1 year ago

@Enny1991 I guess, I am having the same confusion here. In reality, we won't have the exact noise file and target file, in practical use, we will just have the mixed noisy speech file. So how is this beamformer going to work, since I don't have the noise and target?

I think this is the inference stage, not the training stage, so you won't have a target.

Enny1991 commented 1 year ago

@dyustc Yes, this is the inference phase as you mention. The noise and target recordings that you provide to the function are not the ones that are mixed in the mix.wav. Those are samples that you might have recorded previously, those samples are used indeed to extract the spatial features of target and noise (training) and then used to extract the target from the mixture.

You just need one short sample for noise and target (e.g. 4-5 seconds) and then you can apply the beamformer on as much data as you need.

Does that clarify?

dyustc commented 1 year ago

@Enny1991 Very appreciate for the quick reply. I try to understand, is timeA the short sample you mentioned, with this timeA as a start, we can infer the timeB(as much as I want) for real mixtures .

:param mixture: nd_array (n_mics, timeA + timeB) of the mixture recordings :param noise: nd_array (n_mics, timeA) of the noise recordings :param target: nd_array (n_mics, timeA) of the target recordings

For timeA, we have a corresponding noisy, noise, and target. Also, I wonder if noise and target should be mono, but not multiple mics as the input noisy.

Enny1991 commented 1 year ago

@dyustc Yes, timeA will work do learn the beamformer, but it does not have to be exactly timeA it can be some timeA'! Nevertheless noise and target HAVE to be a multichannel recording, with the microphones and the sources in the same position as for the noisy recording. The beamformer works on spatial features, thus a mono recording is not enough to learn the proper beamformer.

dyustc commented 1 year ago

@Enny1991 Yes, I am looking into the calculate_masks functions. Just as I understand, here we are using the noise and target to compute the mask, and then the beamform weights, depending on whether it's MVDR or GEV, then filter sum. and it's done.

And in the paper, it's a BLSTM or FF model(choose either) and the multi-channel noisy file as input, to infer the mask for each T-F at each channel. So I guess the mask inferred this way could change over time.

But in this repo, if we're using a certain noise and target, the mask is kind of fixed. I am sure it still could be put into use, but this is the difference, right?

Enny1991 commented 1 year ago

@dyustc Yes you are correct, that paper uses a data driven approach to learn the (possibly varying over time) separation mask. The solution provided by this repo is indeed fixed for a certain scenario.

In the paper you cite the authors estimate a T-F and not a beamformer. Beamformers have (usually) a better output SDR given the fact that estimation for each T-F bin in the mask is not unconstrained. If you are interested in hybrid approaches, I can suggest this paper where we use various NN to estimate beamformers in realtime: thanks to the fact that GEV and MVDR have differentiable definitions they are part of the model training.

dyustc commented 1 year ago

@Enny1991 Very thanks for the suggestion. The model seems pretty small, and useful. Just a quick ask, besides WER, STOI, SDR, have you experimented this beamformer towards AEC? Perhaps a more steady and better echo cancellation performance?
Also, is there also a repo for this paper? I want to integrate it or refer to it in our AEC network if possible.

Enny1991 commented 1 year ago

@dyustc We have never tried practically with echo cancellation, but I see no reason why it would no work nicely as it is a similar problem. Yes there is a repo for that here, is not as clean as this one and uses some code that might be a little outdated, but the math is there if you want to check it out. In case of questions you can always open an issue there

dyustc commented 1 year ago

@Enny1991 many thanks, this is a lot of help. I try to work on that one.

dyustc commented 1 year ago

Hi, @Enny1991, I opened a issued there in your RealMuD repo, not sure if you could see it. So I pasted here, just in case