facebookresearch / fairseq

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.
MIT License
30.23k stars 6.38k forks source link

No documentation on how use nllb #4547

Open Suhail opened 2 years ago

Suhail commented 2 years ago

πŸ“š Documentation

Hey there,

I am trying to play with nllb but there isn't a basic code sample to try it.

I can download the checkpoint.pt but I am not sure what I would do afterwards.

I notice it's also not available on torch.hub

Can you provide a piece of example code to get a translation?

huihuifan commented 2 years ago

hi, could you take a look at the generation command example here? https://github.com/facebookresearch/fairseq/tree/nllb/examples/nllb/modeling thanks!

Suhail commented 2 years ago

hi, could you take a look at the generation command example here? https://github.com/facebookresearch/fairseq/tree/nllb/examples/nllb/modeling thanks!

Hi, I took a look at the README and I see notes about training but I didn't see how one might do inference to get a translation.

In other situations I see you'll use torch.hub, load it, and call translate(). Is there something similar?

Maybe I missed something?

huihuifan commented 2 years ago

Can you control-f "Generation/Evaluation" in the readme I linked?

We'll look into torch.hub :)

Suhail commented 2 years ago

Can you control-f "Generation/Evaluation" in the readme I linked?

We'll look into torch.hub :)

It's probably not hard to find but this link is broken in that section: https://github.com/facebookresearch/flores/flores200

At least for me, I found that header a little difficult to grok. I think if it were called "Get a translation" or something, it'd be more plain spoken I suppose?

Anyway, I'll try to follow the instructions!

nicholas-entis commented 2 years ago

@Suhail Did you manage to get an example working?

Suhail commented 2 years ago

@Suhail Did you manage to get an example working?

No - I decided to give up for now until someone makes something more accessible.

314esther commented 2 years ago

I agree. There needs to be an easy way to try out the translation feature given the checkpoint. I would like to see something of this nature: m = load_model(checkpoint_path) m.translate("Hello World", 'en', 'de') Hallo Welt

314esther commented 2 years ago

In addition there are hard coded paths for data files that aren't documented. For instance: "/data/nllb/nllb/flores200.en_xx_en.v4.4.256k/data_bin/shard000/dict.ace_Arab.txt" I and other users likely don't have these files (or I couldn't find them).

Please add some more accessible documentation or a simple example. There will also be a learning curve for some in using Hydra and setting up the config files - additional support or pointers will be useful to users who aren't experienced with this setup.

pluiez commented 2 years ago

@314esther @Suhail Hi, You can check here for a convenient script to run the model inference from the command line without having to dealing with the config files.

314esther commented 2 years ago

Thanks pluiez! This gives much greater beginner usability of the inference capability. I know Hydra is a very capable tool especially when training over many gpus, however it's only familiar to a handful of people and the config files can really add a barrier to entry. There was one modification that I had to make to the fairseq code (nllb branch) in order to run the shell script. I needed to add a command line interface for spm_encode in the setup.py. Leaving an issue on NLLB_inference repo with more details (https://github.com/pluiez/NLLB-inference/issues/1).

pluiez commented 2 years ago

Hi, sorry I didn't take this into consideration. I'm assuming these tools are all pre-installed. I will list the required steps before running the script.

amrrs commented 2 years ago

Thanks @pluiez for your repo. I've made a video based on it giving code credits to you.

pluiez commented 2 years ago

@amrrs Thank you for your sharing. Actually I hard-coded the language passed to normalize_punctuation.sh in translate.sh as zho_Hans. Although many languages share English(en) normalization under the hood, Tamil uses Hindi(hi). This has been fixed and you might want to checkout the latest version.

amrrs commented 2 years ago

Oh my bad I didn't notice, Thank you for sharing it @pluiez I'll check out the code

geonm commented 2 years ago

Hi guys. I tested NLLB using huggingface transformers.

NOTE: You should install the latest dev version using below instruction in order to use NLLB tokenizer.

$ pip install git+https://github.com/huggingface/transformers.git

then... test it!

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, pipeline

# available models: 'facebook/nllb-200-distilled-600M', 'facebook/nllb-200-1.3B', 'facebook/nllb-200-distilled-1.3B', 'facebook/nllb-200-3.3B'
model_name = 'facebook/nllb-200-distilled-600M'

model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

source = 'eng_Latn' # English
target = 'kor_Hang' # Korean
translator = pipeline('translation', model=model, tokenizer=tokenizer, src_lang=source, tgt_lang=target)

text = 'Hi, nice to meet you'

output = translator(text, max_length=400)

translated_text = output[0]['translation_text']

print(translated_text) # 'μ•ˆλ…•ν•˜μ„Έμš”, λ°˜κ°€μ›Œμš”'

Language code is described in FLORES-200

Update: I made the huggingface space demo: https://huggingface.co/spaces/Geonmo/nllb-translation-demo

Python-37 commented 2 years ago

Hi, I am trying to train a NLLB model, but I still didn't find any doc about how to get the data_bin or data_conf yet, don't know how to format the train dataset, could you please share your training steps? Could you please list some more detailed explanations about https://github.com/facebookresearch/fairseq/tree/nllb/examples/nllb/modeling#filtering-and-preparing-the-data

gordicaleksa commented 1 year ago

@Suhail in case it's still of any use I made a short tutorial on how to run this directly in fairseq: https://github.com/facebookresearch/fairseq/issues/5292

:)

qaixerabbas commented 7 months ago

hi @geonm thank you for the available model list.

# available models: 'facebook/nllb-200-distilled-600M', 'facebook/nllb-200-1.3B', 'facebook/nllb-200-distilled-1.3B', 'facebook/nllb-200-3.3B'

Can you please where did you get this information? I mean names of these models. I was looking to try different models but could not find this information.

0wwafa commented 3 months ago
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
import torch

tokenizer = AutoTokenizer.from_pretrained(
        "facebook/nllb-200-distilled-600M",  src_lang="eng_Latn")

print("Loading model")
model = AutoModelForSeq2SeqLM.from_pretrained("ychenNLP/nllb-200-3.3b-easyproject")
model.cuda()

input_chunks = ["A translator always risks inadvertently introducing source-language words, grammar, or syntax into the target-language rendering."]
print("Start translation...")
output_result = []

batch_size = 1
for idx in tqdm(range(0, len(input_chunks), batch_size)):
    start_idx = idx
    end_idx = idx + batch_size
    inputs = tokenizer(input_chunks[start_idx: end_idx], padding=True, truncation=True, max_length=128, return_tensors="pt").to('cuda')

    with torch.no_grad():
        translated_tokens = model.generate(**inputs, forced_bos_token_id=tokenizer.lang_code_to_id["zho_Hans"], 
                        max_length=128, num_beams=5, num_return_sequences=1, early_stopping=True)

    output = tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)
    output_result.extend(output)
print(output_result)