Input Feature to TRUNET

yugeshav commented 2 years ago

Hi

As per the paper, 4 features must be concatenated as input to TRUNET,

log spectrum
PCEN
real part of demodulated phase
imaginary part of demodulated phase

so the input will become (Batchsize, 4 features, No.of frames in STFT, No.of STFT bins) , so it is a 4 dimesional one

But in the sample code you are showing input as 3 dimension (1,4,257), since first layer is conv1d

I'm confused whether the input to TRUNET is 3 dimension or 4 dimension ?

Regards Yugesh

atabakp commented 9 months ago

@atabakp did you extract magnitude of input spectrum as torch.abs() and not as torch.real? Yes, with abs().

eagomez2 commented 9 months ago

softmax with temperature,

Nope, still doesn't work. The only thing that "worked" is skipping PHM and multiplying one channel of last output with input, but I didn't wait for it to converge yet. I'll try these fixes, thanks.

One more question for @atabakp : I mentioned that my losses was wrong. By that I meant that in paper loss is the sum of losses for direct source, noise and reverberant path (last equation of section 3.3). How did you calculate them, did you do this sum of 3 or something else? Because I don't see good way to calculate target reverberant path, I just subtract clean signal from reverbed signal, and use tensor of 1e-6 for noise target (since I only train for dereverberation) Also for @eagomez2 : I get that you used different loss - did you calculate it with only direct source?

I calculated it for both direct residual but as mentioned by @atabakp , I'm inclined to think that direct should be good enough although I haven't tried this

JBloodless commented 9 months ago

@atabakp @eagomez2 sorry for bothering again.

I trained couple of experiments with different parameters. I MAINLY can achieve dereverberation, but in all of my experiments there is weird artifact in lower frequencies. It looks like this strange line:

Did you ever encountered this artifact?

eagomez2 commented 9 months ago

Hi @JBloodless ,

In my case at least I didn't observe such problems using the GAN setup previously described.

atabakp commented 9 months ago

@atabakp @eagomez2 sorry for bothering again.

I trained couple of experiments with different parameters. I MAINLY can achieve dereverberation, but in all of my experiments there is weird artifact in lower frequencies. It looks like this strange line:

Did you ever encountered this artifact?

In the process, I discard the DC bin, substituting it with zeros. This situation may also arise from the use of paddings in transpose convolutions, leading to a consistent output of a constant value for the low-frequency bins.

JBloodless commented 9 months ago

@atabakp @eagomez2 sorry for bothering again. I trained couple of experiments with different parameters. I MAINLY can achieve dereverberation, but in all of my experiments there is weird artifact in lower frequencies. It looks like this strange line:

Did you ever encountered this artifact?

In the process, I discard the DC bin, substituting it with zeros. This situation may also arise from the use of paddings in transpose convolutions, leading to a consistent output of a constant value for the low-frequency bins.

Do you discard it in the input only or also in the target?

atabakp commented 9 months ago

@atabakp @eagomez2 sorry for bothering again. I trained couple of experiments with different parameters. I MAINLY can achieve dereverberation, but in all of my experiments there is weird artifact in lower frequencies. It looks like this strange line:

Did you ever encountered this artifact?

In the process, I discard the DC bin, substituting it with zeros. This situation may also arise from the use of paddings in transpose convolutions, leading to a consistent output of a constant value for the low-frequency bins.

Do you discard it in the input only or also in the target?

Discard in the input, and replace it with zero in the output, It means the model is not predicting the mask for DC bin.

JBloodless commented 9 months ago

@atabakp @eagomez2 sorry for bothering again. I trained couple of experiments with different parameters. I MAINLY can achieve dereverberation, but in all of my experiments there is weird artifact in lower frequencies. It looks like this strange line:

Did you ever encountered this artifact?

In the process, I discard the DC bin, substituting it with zeros. This situation may also arise from the use of paddings in transpose convolutions, leading to a consistent output of a constant value for the low-frequency bins.

Do you discard it in the input only or also in the target?

Discard in the input, and replace it with zero in the output, It means the model is not predicting the mask for DC bin.

In my case model stopped dereverbing lower frequencies at all, and overall it sounds like no dereverbed at all (since psychoacoustics and all). Am I getting it correctly that you zeroed out lowest bin in input and model output, but not in target (for loss calculation)? And if so, which n_fft did you use?

For context: first is reverbed input, second - output of the model without zeroes in DC, last - output of the model with zeros in DC

atabakp commented 9 months ago

@atabakp @eagomez2 sorry for bothering again.

I trained couple of experiments with different parameters. I MAINLY can achieve dereverberation, but in all of my experiments there is weird artifact in lower frequencies. It looks like this strange line:

Did you ever encountered this artifact?

In the process, I discard the DC bin, substituting it with zeros. This situation may also arise from the use of paddings in transpose convolutions, leading to a consistent output of a constant value for the low-frequency bins.

Do you discard it in the input only or also in the target?

Discard in the input, and replace it with zero in the output, It means the model is not predicting the mask for DC bin.

In my case model stopped dereverbing lower frequencies at all, and overall it sounds like no dereverbed at all (since psychoacoustics and all). Am I getting it correctly that you zeroed out lowest bin in input and model output, but not in target (for loss calculation)? And if so, which n_fft did you use?

For context: first is reverbed input, second - output of the model without zeroes in DC, last - output of the model with zeros in DC

I am not considering the lowest bin in any calculation, and when reconstructing the signal(ifft) I am appending the 0 as lowest bin

JBloodless commented 8 months ago

@atabakp @eagomez2 Did you try to train this model for 48kHz?

eagomez2 commented 8 months ago

@JBloodless yes, I've trained it at 48kHz

atabakp commented 8 months ago

@atabakp @eagomez2 Did you try to train this model for 48kHz?

I Tried with 16K

JBloodless commented 8 months ago

@JBloodless yes, I've trained it at 48kHz

Can you share, what modifications of architecture did you make? For now I just made convolutions and GRU bigger, but it seems that it's not enough

eagomez2 commented 8 months ago

@JBloodless yes, I've trained it at 48kHz

Can you share, what modifications of architecture did you make? For now I just made convolutions and GRU bigger, but it seems that it's not enough

I didn't do any changes. I just disabled any resampling algorithms (my data was originally at 48kHz) and trained it normally

JBloodless commented 8 months ago

@JBloodless yes, I've trained it at 48kHz

Can you share, what modifications of architecture did you make? For now I just made convolutions and GRU bigger, but it seems that it's not enough

I didn't do any changes. I just disabled any resampling algorithms (my data was originally at 48kHz) and trained it normally

You mentioned paper on GAN losses that you use. Did you use the same setup as in that paper? (discriminator on wave and adversarial + reconstruction)

eagomez2 commented 8 months ago

@JBloodless yes, it's the same setup

caihaoran-00 commented 4 months ago

I also have a question about the TGRU along the same lines. According to the paper:

The decoder is composed of a Time-axis Gated Recurrent Unit (TGRU) block and 1D Transposed Convolutional Neural Network (1D-TrCNN) blocks. The output of the encoder is passed into a unidirectional GRU layer to aggregate the information along the timeaxis

But then, the input to this layer is a (1, 16, 64) and according to PyTorch's GRU documentation, when batch_first=True, the 2nd dimension is the sequence length, which is the case here because batch_first defaults to True and is not changed when the TGRU layer is defined: https://github.com/YangangCao/TRUNet/blob/main/TRUNet.py#LL131C26-L131C26 To my understanding (please correct me if I'm wrong), the TGRU layer will not really aggregate information along the time axis, but will instead do a similar role than the FGRU, but using a unidirectional layer. I assumed first that batch_first should be set to False in order to apply the nn.GRU along the first dimension which is the original time dimension.

#4 (comment)

Hi, @atabakp you mean we need to modify the network,right?

The shape of TGRU's input(x9) is (Time, 16, 64). Since it should aggregate the information along the time-axis and batch_first=True in your implementation, therefore the input of TGRU should have "Time" should be the second dimension. Or set the batch_first=False for the TGRU.

I just add: x9 = x9.transpose(0,1) to make Time second dimension.

Then,GRUBlock's input shape is (16,T,64),output shape is (16,64,T).Next FirstTrCNN's output shape will be (16,64,2(T-1)+1). So,how to concatenate x11(16,64,2(T-1)+1) and x5(T,128,32)?

@atabakp @JBloodless @eagomez2 All responses are welcome

JBloodless commented 4 months ago

@caihaoran-00 I modified whole model to work with separate batch dimension, so my answer probably is not so relevant, but I didn't do anything out of ordinary with TrCNNs, they already have all necessary logic for such concat (mostly paddings)

caihaoran-00 commented 4 months ago

@caihaoran-00 I modified whole model to work with separate batch dimension, so my answer probably is not so relevant, but I didn't do anything out of ordinary with TrCNNs, they already have all necessary logic for such concat (mostly paddings)

OK,thank you very much. I also have a few questions for synthesizing reverberant signals. 1.Is it necessary to normalize the RIRs? 2.Is it necessary to add the clean signal after the rir is convolved with the clean signal?

I'm not sure how exactly synthesize a reverberated signal that has a similar amplitude to the clean signal(Maybe this is causing strange problems with my network training). @JBloodless

JBloodless commented 4 months ago

@caihaoran-00 I modified whole model to work with separate batch dimension, so my answer probably is not so relevant, but I didn't do anything out of ordinary with TrCNNs, they already have all necessary logic for such concat (mostly paddings)

OK,thank you very much. I also have a few questions for synthesizing reverberant signals. 1.Is it necessary to normalize the RIRs? 2.Is it necessary to add the clean signal after the rir is convolved with the clean signal?

I'm not sure how exactly synthesize a reverberated signal that has a similar amplitude to the clean signal(Maybe this is causing strange problems with my network training). @JBloodless

convolved signal doesn't need to be normalized again, changes in amplitude are technically part of reverberation. Just generate IR and convolve it with clean input

caihaoran-00 commented 4 months ago

OK,you use n or (n,1) for your 2-D kernel?

YangangCao / TRUNet

Input Feature to TRUNET #5