Open yugeshav opened 2 years ago
@atabakp did you extract magnitude of input spectrum as torch.abs() and not as torch.real? Yes, with abs().
softmax with temperature,
Nope, still doesn't work. The only thing that "worked" is skipping PHM and multiplying one channel of last output with input, but I didn't wait for it to converge yet. I'll try these fixes, thanks.
One more question for @atabakp : I mentioned that my losses was wrong. By that I meant that in paper loss is the sum of losses for direct source, noise and reverberant path (last equation of section 3.3). How did you calculate them, did you do this sum of 3 or something else? Because I don't see good way to calculate target reverberant path, I just subtract clean signal from reverbed signal, and use tensor of 1e-6 for noise target (since I only train for dereverberation) Also for @eagomez2 : I get that you used different loss - did you calculate it with only direct source?
I calculated it for both direct residual but as mentioned by @atabakp , I'm inclined to think that direct should be good enough although I haven't tried this
@atabakp @eagomez2 sorry for bothering again.
I trained couple of experiments with different parameters. I MAINLY can achieve dereverberation, but in all of my experiments there is weird artifact in lower frequencies. It looks like this strange line:
Did you ever encountered this artifact?
Hi @JBloodless ,
In my case at least I didn't observe such problems using the GAN setup previously described.
@atabakp @eagomez2 sorry for bothering again.
I trained couple of experiments with different parameters. I MAINLY can achieve dereverberation, but in all of my experiments there is weird artifact in lower frequencies. It looks like this strange line:
Did you ever encountered this artifact?
In the process, I discard the DC bin, substituting it with zeros. This situation may also arise from the use of paddings in transpose convolutions, leading to a consistent output of a constant value for the low-frequency bins.
@atabakp @eagomez2 sorry for bothering again. I trained couple of experiments with different parameters. I MAINLY can achieve dereverberation, but in all of my experiments there is weird artifact in lower frequencies. It looks like this strange line:
Did you ever encountered this artifact?
In the process, I discard the DC bin, substituting it with zeros. This situation may also arise from the use of paddings in transpose convolutions, leading to a consistent output of a constant value for the low-frequency bins.
Do you discard it in the input only or also in the target?
@atabakp @eagomez2 sorry for bothering again. I trained couple of experiments with different parameters. I MAINLY can achieve dereverberation, but in all of my experiments there is weird artifact in lower frequencies. It looks like this strange line:
Did you ever encountered this artifact?
In the process, I discard the DC bin, substituting it with zeros. This situation may also arise from the use of paddings in transpose convolutions, leading to a consistent output of a constant value for the low-frequency bins.
Do you discard it in the input only or also in the target?
Discard in the input, and replace it with zero in the output, It means the model is not predicting the mask for DC bin.
@atabakp @eagomez2 sorry for bothering again. I trained couple of experiments with different parameters. I MAINLY can achieve dereverberation, but in all of my experiments there is weird artifact in lower frequencies. It looks like this strange line:
Did you ever encountered this artifact?
In the process, I discard the DC bin, substituting it with zeros. This situation may also arise from the use of paddings in transpose convolutions, leading to a consistent output of a constant value for the low-frequency bins.
Do you discard it in the input only or also in the target?
Discard in the input, and replace it with zero in the output, It means the model is not predicting the mask for DC bin.
In my case model stopped dereverbing lower frequencies at all, and overall it sounds like no dereverbed at all (since psychoacoustics and all). Am I getting it correctly that you zeroed out lowest bin in input and model output, but not in target (for loss calculation)? And if so, which n_fft did you use?
For context: first is reverbed input, second - output of the model without zeroes in DC, last - output of the model with zeros in DC
@atabakp @eagomez2 sorry for bothering again.
I trained couple of experiments with different parameters. I MAINLY can achieve dereverberation, but in all of my experiments there is weird artifact in lower frequencies. It looks like this strange line:
Did you ever encountered this artifact?
In the process, I discard the DC bin, substituting it with zeros. This situation may also arise from the use of paddings in transpose convolutions, leading to a consistent output of a constant value for the low-frequency bins.
Do you discard it in the input only or also in the target?
Discard in the input, and replace it with zero in the output, It means the model is not predicting the mask for DC bin.
In my case model stopped dereverbing lower frequencies at all, and overall it sounds like no dereverbed at all (since psychoacoustics and all). Am I getting it correctly that you zeroed out lowest bin in input and model output, but not in target (for loss calculation)? And if so, which n_fft did you use?
For context: first is reverbed input, second - output of the model without zeroes in DC, last - output of the model with zeros in DC
I am not considering the lowest bin in any calculation, and when reconstructing the signal(ifft) I am appending the 0 as lowest bin
@atabakp @eagomez2 Did you try to train this model for 48kHz?
@JBloodless yes, I've trained it at 48kHz
@atabakp @eagomez2 Did you try to train this model for 48kHz?
I Tried with 16K
@JBloodless yes, I've trained it at 48kHz
Can you share, what modifications of architecture did you make? For now I just made convolutions and GRU bigger, but it seems that it's not enough
@JBloodless yes, I've trained it at 48kHz
Can you share, what modifications of architecture did you make? For now I just made convolutions and GRU bigger, but it seems that it's not enough
I didn't do any changes. I just disabled any resampling algorithms (my data was originally at 48kHz) and trained it normally
@JBloodless yes, I've trained it at 48kHz
Can you share, what modifications of architecture did you make? For now I just made convolutions and GRU bigger, but it seems that it's not enough
I didn't do any changes. I just disabled any resampling algorithms (my data was originally at 48kHz) and trained it normally
You mentioned paper on GAN losses that you use. Did you use the same setup as in that paper? (discriminator on wave and adversarial + reconstruction)
@JBloodless yes, it's the same setup
I also have a question about the TGRU along the same lines. According to the paper:
The decoder is composed of a Time-axis Gated Recurrent Unit (TGRU) block and 1D Transposed Convolutional Neural Network (1D-TrCNN) blocks. The output of the encoder is passed into a unidirectional GRU layer to aggregate the information along the timeaxis
But then, the input to this layer is a
(1, 16, 64)
and according to PyTorch's GRU documentation, whenbatch_first=True
, the 2nd dimension is the sequence length, which is the case here becausebatch_first
defaults toTrue
and is not changed when theTGRU
layer is defined: https://github.com/YangangCao/TRUNet/blob/main/TRUNet.py#LL131C26-L131C26 To my understanding (please correct me if I'm wrong), theTGRU
layer will not really aggregate information along the time axis, but will instead do a similar role than theFGRU
, but using a unidirectional layer. I assumed first thatbatch_first
should be set toFalse
in order to apply thenn.GRU
along the first dimension which is the original time dimension.
Hi, @atabakp you mean we need to modify the network,right?
The shape of TGRU's input(x9) is (Time, 16, 64). Since it should aggregate the information along the time-axis and batch_first=True in your implementation, therefore the input of TGRU should have "Time" should be the second dimension. Or set the batch_first=False for the TGRU.
I just add: x9 = x9.transpose(0,1) to make Time second dimension.
Then,GRUBlock's input shape is (16,T,64),output shape is (16,64,T).Next FirstTrCNN's output shape will be (16,64,2(T-1)+1). So,how to concatenate x11(16,64,2(T-1)+1) and x5(T,128,32)?
@atabakp @JBloodless @eagomez2 All responses are welcome
@caihaoran-00 I modified whole model to work with separate batch dimension, so my answer probably is not so relevant, but I didn't do anything out of ordinary with TrCNNs, they already have all necessary logic for such concat (mostly paddings)
@caihaoran-00 I modified whole model to work with separate batch dimension, so my answer probably is not so relevant, but I didn't do anything out of ordinary with TrCNNs, they already have all necessary logic for such concat (mostly paddings)
OK,thank you very much. I also have a few questions for synthesizing reverberant signals. 1.Is it necessary to normalize the RIRs? 2.Is it necessary to add the clean signal after the rir is convolved with the clean signal?
I'm not sure how exactly synthesize a reverberated signal that has a similar amplitude to the clean signal(Maybe this is causing strange problems with my network training). @JBloodless
@caihaoran-00 I modified whole model to work with separate batch dimension, so my answer probably is not so relevant, but I didn't do anything out of ordinary with TrCNNs, they already have all necessary logic for such concat (mostly paddings)
OK,thank you very much. I also have a few questions for synthesizing reverberant signals. 1.Is it necessary to normalize the RIRs? 2.Is it necessary to add the clean signal after the rir is convolved with the clean signal?
I'm not sure how exactly synthesize a reverberated signal that has a similar amplitude to the clean signal(Maybe this is causing strange problems with my network training). @JBloodless
convolved signal doesn't need to be normalized again, changes in amplitude are technically part of reverberation. Just generate IR and convolve it with clean input
OK,you use n or (n,1) for your 2-D kernel?
Hi
As per the paper, 4 features must be concatenated as input to TRUNET,
so the input will become (Batchsize, 4 features, No.of frames in STFT, No.of STFT bins) , so it is a 4 dimesional one
But in the sample code you are showing input as 3 dimension (1,4,257), since first layer is conv1d
I'm confused whether the input to TRUNET is 3 dimension or 4 dimension ?
Regards Yugesh