facebookresearch / seamless_communication

Foundational Models for State-of-the-Art Speech and Text Translation
Other
10.51k stars 1.02k forks source link

Sequential Execution of translator.predict() in Multithreaded Environment #444

Open lin-xiaosheng opened 1 month ago

lin-xiaosheng commented 1 month ago

Problem Description

When using the seamless_communication library and specifically its translator.predict() method, I've encapsulated the core inference logic into reusable interfaces. However, I've noticed that even after wrapping it up, multiple requests are not processed concurrently; instead, they execute sequentially, significantly impacting system throughput and response time.

Relevant Code Snippet

@app.route('/translate/s2st', methods=['POST'])
def translate_s2st():
    data = request.form
    audio_data = request.files.get('audio') 
    if audio_data is None:
        return jsonify({"error": "No audio data provided"}), 400

    source_language = data.get('source_language')
    target_language = data.get('target_language')

    if not all([audio_data, source_language, target_language]):
        return jsonify({"error": "Missing required parameters"}), 400

    try:
        waveform, _ = process_audio_stream(audio_data.stream)
        audio_tensor = waveform.to(device=device, dtype=dtype)

        source_language_code = LANGUAGE_NAME_TO_CODE[source_language]
        target_language_code = LANGUAGE_NAME_TO_CODE[target_language]

        out_texts, out_audios = translator.predict(
            input=audio_tensor, 
            task_str="S2ST",
            src_lang=source_language_code,
            tgt_lang=target_language_code,
        )
        out_text = str(out_texts[0])
        # out_wav = out_audios.audio_wavs[0].cpu().detach().numpy()

        wav_buffer = BytesIO()
        torchaudio.save(wav_buffer, out_audios.audio_wavs[0][0].to(torch.float32).cpu(), out_audios.sample_rate,format="mp3")
        wav_buffer.seek(0)  

        response = {
            "translated_text": out_text,
        }

        return send_file(
            wav_buffer,
            mimetype="audio/mp3",
            as_attachment=True,
            download_name=f"translated_audio.mp3",
        )

    except Exception as e:
        logging.error(f"Translation error: {str(e)}")
        return jsonify({"error": str(e)}), 500

Result: All tasks complete sequentially rather than concurrently.

Expected Behavior

I would expect to utilize parallel processing capabilities on multi-core GPU by running multiple translations concurrently.

Actual Results

Only one request is processed at a time, leading to increased overall execution time.

Can you please suggest modifications or configurations needed in my code so that predictor.translate() can handle concurrent translations effectively?

If you need any further information or clarification, feel free to reply. Looking forward to your assistance in optimizing our application performance!

Thanks!