facebookresearch / seamless_communication

Foundational Models for State-of-the-Art Speech and Text Translation
Other
10.53k stars 1.02k forks source link

RuntimeError: Calculate padded input size per channel: (0). Kernel size: (1). Kernel size can't be greater than input size. #149

Closed hjpr closed 9 months ago

hjpr commented 9 months ago

Using the m4t_predict script results in this message, even when using the sample_input.mp3 files that are provided in /assets with the demo. If I load up the app.py version served through gradio it works correctly with expected outputs.

This issue actually started to occur suddenly. I was able to use the m4t_predict script and retrieve audio outputs but all of a sudden it started throwing this tensor size error. Since I'm not familiar with what's going on under the hood I attempted to reclone the repository and reinstall the venv. Has not solved the issue.

The error is being generated from /lib/python3.11/site-packages/torch/nn/modules/conv.py at line 309.

lin-xiaosheng commented 9 months ago

Solved the problem. Underneath the hood, the demo converts arbitrary sample rates -> 16kHz and arbitrary data types to 16-bit int format. The m4t_predict script doesn't check or convert, so the tensor sizes were incompatible if the sample rates or data types weren't converted prior to translation request,

Leaving this up for another newby like me. I will put up a convert.py script that simply pulls Gradio's processing_utils.py into a CLI form.

Hello, I've also encountered the same issue. Could you please describe in detail how you resolved it?

this is my code:

from flask import Flask, request, jsonify import torch from seamless_communication.models.inference import Translator import torchaudio

app = Flask(name)

初始化 Translator 对象

translator = Translator("seamlessM4T_large", vocoder_name_or_card="vocoder_36langs", device=torch.device("cuda:0"))

@app.route('/translate', methods=['POST']) def translate_audio():

if 'src_lang' not in request.form:

#     return jsonify({'error': 'src_lang parameter is missing'}), 400

if 'audio_file' not in request.files:
    return jsonify({'error': 'audio_file is missing'}), 400

src_lang = "cmn"
audio_file = request.files['audio_file']
# 长度限制
MAX_INPUT_AUDIO_LENGTH = 60  # 录音文件最大长度,单位秒
AUDIO_SAMPLE_RATE = 16000.0  # 录音文件采样率,单位 Hz
arr, org_sr = torchaudio.load(audio_file)
new_arr = torchaudio.functional.resample(arr, orig_freq=org_sr, new_freq=AUDIO_SAMPLE_RATE)
max_length = int(MAX_INPUT_AUDIO_LENGTH * AUDIO_SAMPLE_RATE)
if new_arr.shape[1] > max_length:
    new_arr = new_arr[:, :max_length]
torchaudio.save(audio_file, new_arr, sample_rate=int(AUDIO_SAMPLE_RATE), format="wav")
try:
    # 执行 ASR
    transcribed_text, _, _ = translator.predict(new_arr, "asr", src_lang, sample_rate=AUDIO_SAMPLE_RATE)
    return jsonify({'transcribed_text': transcribed_text})
except Exception as e:
    return jsonify({'error': str(e)}), 500

if name == 'main': app.run(host='0.0.0.0', port=7860)

hjpr commented 9 months ago

I think you have two errors going on here. One is that you are attempting to use torchaudio.save before you perform inference with translator.predict. Torchaudio.save should be the last thing you do because you need to keep the audio file in a tensor format, not pass a .wav to the predict function. Secondly, see below. When trying to save using torchaudio.save you need to ensure that the tensors get moved back to the cpu after resampling etc.

Maybe a dev can shine a light on this, but I believe that the problem is that we are misusing torchaudio.save. In the predict.py script in /scripts/m4t/predict you can see at the very end, once the prediction is made, they save the output file using

torchaudio.save(args.output_path, wav[0].to(torch.float32).cpu(), sample_rate=sr)

Try removing your torchaudio.save line (new_arr will hold the tensor required for translator.predict). You actually don't need torchaudio.save because you are using ASR, and torchaudio.save is for converting a tensor back to a .wav.

lin-xiaosheng commented 9 months ago

Thank you for your detailed response! With your help, I have removed the unnecessary parts of the code, and currently, my core code is as follows: audio_file = request.files['audio_file'] AUDIO_SAMPLE_RATE = 16000.0 arr, org_sr = torchaudio.load(audio_file) new_arr = torchaudio.functional.resample( arr, orig_freq=org_sr, new_freq=AUDIO_SAMPLE_RATE ) transcribedtext, , _ = translator.predict(input=new_arr, task_str='ASR', tgt_lang='cmn', src_lang='cmn', ngram_filtering=True)

However, I am still encountering the same error as you: { "error": "Calculated padded input size per channel: (0). Kernel size: (1). Kernel size can't be greater than actual input size" } Is there anything else I might not have done correctly? I look forward to your reply.

hjpr commented 9 months ago

lin, you may want to try running the audio file through predict.py at /scripts/m4t/predict to see if the audio file will translate using predict.py and then comparing how predict.py handles the audio vs how you are handling the audio. i was getting the error until i started using predict.py.

today im going to go through predict.py's behaviors and compare it to my own code to try to figure out what the difference is between the approaches.

hjpr commented 9 months ago

Okay Lin, I figured it out!

I had assumed we needed to pass a tensor here...

transcribed_text, _, _ = translator.predict(new_arr, "asr", src_lang, sample_rate=AUDIO_SAMPLE_RATE)

In your case, you are passing a tensor via new_arr as the tensor results from load, as well as resample.

But really we do actually need to pass a .wav. I was mistaken yesterday. The issue you were having with your original torchaudio.save is that you were using it to save new_arr to a wav, but then continuing to reference the original tensor via your new_arr variable being passed to translator.predict.

So in short, ensure that you are first saving your tensor as a .wav file using torchaudio.save(and it needs to be specified in your output path), and then using that .wav file to pass THAT into translator.predict. I just ran a few successful tests after changing my code around to reflect that. I have a temp folder I'm using to save and process files...

@app.get("/translate/v1/s2st/spa/", response_class=FileResponse)
async def process_audio() -> File:

    audio_file = 'temp/sample.wav'
    translator = Translator(MODEL_NAME, VOCODER_NAME, device, dtype)
    _, waveform, sample_rate, = translator.predict(audio_file,
                                                   task_str="s2st",
                                                   tgt_lang="spa",
                                                   src_lang=None,
                                                   ngram_filtering=True)

    torchaudio.save("temp/converted.wav", waveform[0].to(torch.float32).cpu(), sample_rate=sample_rate)
    return "temp/converted.wav"

Adjust to your needs as necessary. Would probably look something like this...

if 'audio_file' not in request.files:
    return jsonify({'error': 'audio_file is missing'}), 400

src_lang = "cmn"
audio_file = request.files['audio_file']
# 长度限制
MAX_INPUT_AUDIO_LENGTH = 60  # 录音文件最大长度,单位秒
AUDIO_SAMPLE_RATE = 16000.0  # 录音文件采样率,单位 Hz
arr, org_sr = torchaudio.load(audio_file)
new_arr = torchaudio.functional.resample(arr, orig_freq=org_sr, new_freq=AUDIO_SAMPLE_RATE)
max_length = int(MAX_INPUT_AUDIO_LENGTH * AUDIO_SAMPLE_RATE)
if new_arr.shape[1] > max_length:
    new_arr = new_arr[:, :max_length]
audio_path = "temp/translated.wav"
torchaudio.save(audio_path, new_arr[0].to(torch.float32).cpu(), sample_rate=int(AUDIO_SAMPLE_RATE))
try:
    # 执行 ASR
    transcribed_text, _, _ = translator.predict(audio_path, "asr", src_lang, sample_rate=AUDIO_SAMPLE_RATE)
    return jsonify({'transcribed_text': transcribed_text})
except Exception as e:
    return jsonify({'error': str(e)}), 500

I altered my code to help debug, so it's not performing the upload portion via requests and also its saving the file to disk as well as responding with a file download, but I wanted to get it working first, and then mess with the FastAPI specific methods....

hjpr commented 9 months ago

Closing unless there are future issues resulting in same error code. In summary, translator.predict needs it's input to be a .wav file, and performing tensor operations without converting it and referencing that converted .wav file properly results in the mentioned error.

lin-xiaosheng commented 9 months ago

Yes, it does work indeed! Friend, as you said, both saving the file and referencing the converted file correctly are indispensable! Your solution was not only very comprehensive, but your patience, clear logical thinking, and meticulous attitude are also worthy of admiration.your contributions not only resolved my immediate problem but also enriched my understanding of the underlying concepts. thank you, hjpr, for your exceptional help,and I am deeply grateful for your assistance!