aeromamba-super-resolution / aeromamba

Official implementation of "AEROMamba: An efficient architecture for audio super-resolution using generative adversarial networks and state space models", presented in LAMIR 2024 Workshop
Creative Commons Zero v1.0 Universal
14 stars 1 forks source link

Speech Handling? #3

Open chris-calo opened 1 week ago

chris-calo commented 1 week ago

Has this been tested with speech yet? I just trained it on 4-16, per the original Aero recommendations (fixing some bugs along the way), against the VCTK dataset. I downsampled my test audio to 4kHz for the sake of testing, and while I got it running, it only rendered some static.

Any ideas? Can provide details, where needed. I'm guessing the base model might be too fitted for only music super resolution at the moment.

On the positive, MUCH faster than Aero on my 4090s, making it more suitable for my real-time use-case.

abreuwallace commented 5 days ago

Hi, thanks for trying to extend it!

I've never tried it on speech data during the course of its development, but I always thought that it would be able to handle that nature of problem, since AERO does it, and there are also some Mamba applications for speech enhancement.

I ran a mini experiment with 2 speakers (1 train and 1 test) and it seems that the model is able to extend ( although badly, obviously :) ).

spec

I see two possible reasons for that bug:

1) Your dowsampling procedure is not being recognized by the model, therefore I suggest that you use a sox command such as sox "$input_file" -r freq_samp -c 1 "$output_file". 2) The prediction code must be changed is some kind of way to handle this sampling frequency, since all of my setting is for 11.025 -> 44.1 kHz.

If the results in the output folder are being extended (in my case they are), then it is definitely one of those cases.