Hi owen,
Thanks for your contributions!
In your paper,you said you applied a series of strided 1D convolutions to the input waveform.
So the input waveform you refered here (before fusion) is the original audio signal waveform without STFT,right?
Why and how you process the 1D signal ? Could you kindly explain this point for me?
Hi owen, Thanks for your contributions! In your paper,you said you applied a series of strided 1D convolutions to the input waveform. So the input waveform you refered here (before fusion) is the original audio signal waveform without STFT,right? Why and how you process the 1D signal ? Could you kindly explain this point for me?