AI4Bharat / IndicTrans2

Translation models for 22 scheduled languages of India
https://ai4bharat.iitm.ac.in/indic-trans2
MIT License
214 stars 59 forks source link

Getting <unk> token occasionaly in output #48

Closed kurianbenoy closed 5 months ago

kurianbenoy commented 6 months ago

Hello team,

Thanks for building this wonderful open-source project. I sometimes notice that output is returned with tokens.

I got this when translating an Malayalam article to English.

unnamed

The results are not deterministic always as well. Sometimes it happens with tokens and sometimes it doesn't work. How can I get deterministic results always?

jaygala24 commented 6 months ago

Hey @kurianbenoy

Can you share details about the hyperparameters used for decoding along with commands used for running the script?

kurianbenoy commented 6 months ago

Hey @jaygala24 ,

This was the code I used for inference and as you can see below in model generate, I used hyperparameters such as min_length, max_length, num_beams, num_return_sequences. Is there any parameter to ensure that I get deterministic output always?

def batch_translate(input_sentences, src_lang, tgt_lang, model, tokenizer, ip, device):
    import torch
    BATCH_SIZE = 16

    translations = []
    for i in range(0, len(input_sentences), BATCH_SIZE):
        batch = input_sentences[i : i + BATCH_SIZE]

        # Preprocess the batch and extract entity mappings
        batch = ip.preprocess_batch(batch, src_lang=src_lang, tgt_lang=tgt_lang)

        # Tokenize the batch and generate input encodings
        inputs = tokenizer(
            batch,
            src=True,
            truncation=True,
            padding="longest",
            return_tensors="pt",
            return_attention_mask=True,
        ).to(device)

        # Generate translations using the model
        with torch.no_grad():
            generated_tokens = model.generate(
                **inputs,
                use_cache=True,
                min_length=0,
                max_length=256,
                num_beams=5,
                num_return_sequences=1,
            )

        # Decode the generated tokens into text
        generated_tokens = tokenizer.batch_decode(generated_tokens.detach().cpu().tolist(), src=False)

        # Postprocess the translations, including entity replacement
        translations += ip.postprocess_batch(generated_tokens, lang=tgt_lang)

        del inputs
        torch.cuda.empty_cache()

    return translations
PranjalChitale commented 6 months ago

@kurianbenoy

This is straight from the example script and doesn't help us debug the issue.

Please provide the following basic details to proceed further.

  1. Input text (plain text format)
  2. Translation generated by the model (plain text format)
  3. Model that is being used and details about quantization.

The outputs are indeed deterministic if you don't do sampling, you can try re-running the example any number of times, you will always get the same outputs.

Also, looking at your image, it seems you are trying to translate a large chunk of text using batch_translate.

_Note that IndicTrans2 is trained with a max_seqlen of 256 tokens and it is not surprising that for such long passages, the outputs would obviously be suboptimal.

In such cases, the right way to do translations is to use any sentence splitter of your choice, segment the paragraph into constituent batch of sentences, use batch translate on the batch of sentences and rejoin the translated batch of sentences to once again reconstruct the paragraph.

You can find an example script that does this here.

kurianbenoy commented 6 months ago

Hello @PranjalChitale, I will try to use sentence splitter of my choice and pass results as chunks of sentence.

I am doubtful about the deterministic nature of translation output you suggested. Please check the outputs of input text I give, and why I say that way.

kurianbenoy commented 6 months ago

Please provide the following basic details to proceed further.

  1. Input text (plain text format)

This article is from a leading news provider in Kerala. PFA article:

https://gist.github.com/kurianbenoy/9b12b76f7aebab3b2c9e684999f8bc05

  1. Translation generated by the model (plain text format)

I once got this output with tokens

unnamed

Yet now I don't get it anymore, which confused me a lot because model is not behaving determenistically.

image

  1. Model that is being used and details about quantization.

I am using the best model ai4bharat/indictrans2-indic-en-1B with no quantization config

indic_en_ckpt_dir = "ai4bharat/indictrans2-indic-en-1B" 
qconfig = None
model = AutoModelForSeq2SeqLM.from_pretrained(
        indic_en_ckpt_dir,
        trust_remote_code=True,
        low_cpu_mem_usage=True,
        quantization_config=qconfig,
        revision="74a49d37c8bfd517d2cc04db8cd1baf8cef1fe7a"
)
jaygala24 commented 6 months ago

Hi @kurianbenoy

I was able to translate the article you shared in this issue correctly when fragmented at the sentence level. I would like to point out that the IndicTrans2 models are trained with sentence-level translation pairs so it is likely that the model may produce unexpected translations when passing the entire paragraph as the input. I have modified the example.py (link) script to also support paragraph-level translations in the latest commit.

Regarding the deterministic nature of the translations: I am able to reproduce the translations with the same article as input for multiple runs. You can refer to the translation in output.txt attached to this reply. I suspect that you are using sampling-based decoding which might possibly lead to non-deterministic outputs.

output.txt

One difference I notice in your model loading is the use of the revision argument. Please try once with removing the following argument.

kurianbenoy commented 6 months ago

Hi @jaygala24, thanks for updating the example.py so it supports paragraph-level translations.

One difference I notice in your model loading is the use of the revision argument. Please try once with removing the following argument.

I can try that as well. Yet I do that for stability, so it is pointed to correct git commit in huggingface folder. If you update the model repository in huggingface, with revision argument it won't be updated. While without revision argument, it takes latest commit from model repository.

I suspect that you are using sampling-based decoding which might possibly lead to non-deterministic outputs.

I was using the same example.py code, when I got <unk> token and when it was corrrectly fixed with the output I shared. The output you shared was exactly the same itself.

I didn't change even 1 line of my code. That's why I suspect something else is leading to non-deterministic output sporadically. Yet strangely, I encountered this two weeks before only and was noticed by another user. Yet later I didn't get the output with unk tokens last week, in my testing. So this bug is a bit strange to reproduce.

jaygala24 commented 5 months ago

Yes, I know that revision points out to the correct git commit in huggingface folder. However, I would like to inform you that we won't be releasing any updates to the IndicTrans2 checkpoints on Huggingface.

I suspect that one of the following reasons might be behind the occurrence of <unk> tokens in the translations:

It looks like the issue has been resolved based on your recent response, so I'll close this issue for now. Feel free to re-open the issues in case of any further queries. Thank you!