facebookresearch / seamless_communication

Foundational Models for State-of-the-Art Speech and Text Translation
Other
10.55k stars 1.02k forks source link

maximum length for t2tt #78

Open maherr13 opened 10 months ago

maherr13 commented 10 months ago

Hi, thanks for the great effort. I tried to use translation t2tt using long text, the results were translated fine but seemed to be like a summary of the text. what is the maximum sequence length for the the text to use as input to get exact translation?

I also would like to ask if there is a function that supports batch inference.

Many Thanks

kauterry commented 10 months ago

Regarding batch inference:

https://github.com/facebookresearch/seamless_communication/blob/main/src/seamless_communication/models/inference/translator.py#L130-L136 is where we call the generator.

And as you can see here: https://github.com/facebookresearch/seamless_communication/blob/main/src/seamless_communication/models/unity/generator.py#L146-L147

Would you like us to provide a method provides batching for a custom dataset, or a list of input texts or audio paths?

Currently we have the Speech2SpeechFleursDatasetBuilder: https://github.com/facebookresearch/seamless_communication/blob/main/src/seamless_communication/datasets/huggingface.py#L28 which gives you an iterator through fleurs HF dataset. Do you want batching support for this?

maherr13 commented 10 months ago

Exactly, What I was looking for is a function where the input could be a list of text, enabling the model to process them in parallel to achieve optimized speed.

elbayadm commented 10 months ago

@maherr13 max_seq_len in the T2TT model is set to 1024 subword tokens (see NLLB dense_1b config). That said sentence-level MT training data is usually short (on average <50 tokens per sentence). If you want to translate long input, I'd recommend splitting it into sentences then concatenate the translations.

import torch
from seamless_communication.models.inference import Translator
translator_large = Translator("seamlessM4T_large", "vocoder_36langs", torch.device("cuda:0"), torch.float16)

input1 = "A train is a series of connected vehicles that run along a railway track and transport people or freight. Trains are typically pulled or pushed by locomotives (often known simply as engines), though some are self-propelled, such as multiple units. Passengers and cargo are carried in railroad cars, also known as wagons."

input2 =  "Trains are designed to a certain gauge, or distance between rails. Most trains operate on steel tracks with steel wheels, the low friction of which makes them more efficient than other forms of transport."

input3 = "Trains have their roots in wagonways, which used railway tracks and were powered by horses or pulled by cables. Following the invention of the steam locomotive in the United Kingdom in 1804, trains rapidly spread around the world, allowing freight and passengers to move over land faster and cheaper than ever possible before. Rapid transit and trams were first built in the late 1800s to transport large numbers of people in and around cities."

If I translate the split inputs, the translations are accurate:

for input in [input1, input2, input3]:
     translated_text, _, _ = translator_large.predict(input, "t2tt", 'fra', src_lang='eng')

input 1 into French: Un train est une série de véhicules connectés qui circulent le long d'une voie ferrée et transportent des personnes ou des marchandises. Les trains sont généralement tirés ou poussés par des locomotives (souvent appelées simplement moteurs), bien que certains soient autopropulsés, tels que des unités multiples. Les passagers et les marchandises sont transportés dans des wagons, également connus sous le nom de wagons.

input 2 into French: Les trains sont conçus pour un certain écartement, ou distance entre les rails. La plupart des trains circulent sur des voies en acier avec des roues en acier, dont la faible friction les rend plus efficaces que les autres formes de transport.

input 3 into French: Les trains ont leurs racines dans les wagons, qui utilisaient des voies ferrées et étaient alimentés par des chevaux ou tirés par des câbles. Après l'invention de la locomotive à vapeur au Royaume-Uni en 1804, les trains se sont rapidement répandus dans le monde entier, permettant aux marchandises et aux passagers de se déplacer sur terre plus rapidement et moins cher que jamais auparavant. Le transport rapide et les tramways ont été construits à la fin des années 1800 pour transporter un grand nombre de personnes dans et autour des villes.

Whereas translating everything as a single input yields this short translation: Full input into French: Les trains ont leurs racines dans les wagons, qui utilisaient des voies ferrées et étaient alimentés par des chevaux ou tirés par des câbles. Après l'invention de la locomotive à vapeur au Royaume-Uni en 1804, les trains se sont rapidement répandus dans le monde entier, permettant aux passagers et aux marchandises de se déplacer sur la terre plus rapidement et moins cher que jamais. Les tramways et les tramways rapides ont été construits à la fin des années 1800 pour transporter des gens et des marchandises en grand nombre dans les villes.

This is actually not a summarisation, but the model is only translating the last part of the input. We can force the model during beam search to maximise the coverage of the source sentences, but we don't support such search in our current code. See https://aclanthology.org/D18-1342.pdf for some re-scoring methods including coverage.

maherr13 commented 10 months ago

@elbayadm Thanks a lot. After trying a few times in t2tt I found that it actually yielded better results compared to the nllb project, especially for low-resource languages. can you elaborate on the difference between them or why seamless outperforms it in these cases?

maherr13 commented 10 months ago

@kauterry I modified the predict function so that it accepts list as input then I modified line with

results = []
for text in input:
    aa = self.collate(self.token_encoder(text))
    results.append(aa)

src = {
    'is_ragged': False,
    'seqs': torch.cat([result['seqs'] for result in results], dim=0),
    'seq_lens': torch.cat([result['seq_lens'] for result in results], dim=0)
        }

and return the full results and parsed it in my code.

MagicMuscleMan commented 9 months ago

I created another python file (t2tt_lbl.py) which translates each line separately without reinitializing the Translator again and again (which speeds up things).

# Perform text-to-text translation with SeamlessM4T line-by-line read from stdin and written to stdout

import sys
import torch
import argparse
from seamless_communication.models.inference import Translator

def main():
    parser = argparse.ArgumentParser(
        description="M4T inference on supported tasks using Translator."
    )
    parser.add_argument(
        "tgt_lang", type=str, help="Target language to translate/transcribe into."
    )
    parser.add_argument(
        "--src_lang",
        type=str,
        help="Source language, only required if input is text.",
        default=None,
    )
    parser.add_argument(
        "--output_path",
        type=str,
        help="Path to save the generated audio.",
        default=None,
    )
    parser.add_argument(
        "--model_name",
        type=str,
        help="Base model name (`seamlessM4T_medium`, `seamlessM4T_large`)",
        default="seamlessM4T_large",
    )
    parser.add_argument(
        "--vocoder_name", type=str, help="Vocoder name", default="vocoder_36langs"
    )
    parser.add_argument(
        "--ngram-filtering",
        type=bool,
        help="Enable ngram_repeat_block (currently hardcoded to 4, during decoding) and ngram filtering over units (postprocessing)",
        default=False,
    )

    args = parser.parse_args()

    if torch.cuda.is_available():
        device = torch.device("cuda:0")
        dtype = torch.float16
    else:
        device = torch.device("cpu")
        dtype = torch.float32

    translator = Translator(args.model_name, args.vocoder_name, device, dtype)

    for line in sys.stdin:
        translated_text, _, _ = translator.predict(line.rstrip(), "t2tt", args.tgt_lang, src_lang=args.src_lang)
        print(translated_text)

if __name__ == "__main__":
    main()

This reads from stdin and writes to stdout and can be used like echo -e "This is the first sentence. This is the second sentence" | rg '(\. )' -r '.\n' | python3 t2tt_lbl.py --src_lang $SOURCE_LANG $TARGET_LANG | tr '\n' ' ' i.e. somehow splitting the content into lines in such a way that no line hits the limit of 1024 subword tokens, doing the translation and removing the line breaks again (which should probably take care of already existing paragraphs, but you can easily modify the example yourself).

aliencaocao commented 4 months ago

@kauterry I modified the predict function so that it accepts list as input then I modified line with

results = []
for text in input:
    aa = self.collate(self.token_encoder(text))
    results.append(aa)

src = {
    'is_ragged': False,
    'seqs': torch.cat([result['seqs'] for result in results], dim=0),
    'seq_lens': torch.cat([result['seq_lens'] for result in results], dim=0)
        }

and return the full results and parsed it in my code.

How did you pad the inputs? Im getting torch.cat error because my inputs have different lengths