AI4Bharat / IndicTrans2

Translation models for 22 scheduled languages of India
https://ai4bharat.iitm.ac.in/indic-trans2
MIT License
226 stars 63 forks source link

Loosing Formatting post translation #72

Closed samayra2029 closed 5 months ago

samayra2029 commented 5 months ago

When I am converting paragraph from English to Hindi, as explained in example.py. Is there way to save the source markdown formatting. My output is returned from LLM.

en_text = "Here is a paragraph that is written in Hindi.\nHindi is a beautiful language and the national language of India.\nIt has evolved over different periods and is spoken particularly in the Indian subcontinent. The literature, culture and history of the Hindi language are also very proud."
src_lang, tgt_lang = "eng_Latn", "hin_Deva"
hi_translated_text = translate_paragraph(
    en_text, src_lang, tgt_lang, en_indic_model, en_indic_tokenizer, ip
)

--- output eng_Latn: Here is a paragraph that is written in Hindi. Hindi is a beautiful language and the national language of India. It has evolved over different periods and is spoken particularly in the Indian subcontinent. The literature, culture and history of the Hindi language are also very proud. hin_Deva: यहाँ एक अनुच्छेद है जो हिंदी में लिखा गया है। हिंदी एक सुंदर भाषा और भारत की राष्ट्रीय भाषा है। यह विभिन्न अवधियों में विकसित हुआ है और विशेष रूप से भारतीय उपमहाद्वीप में बोली जाती है। हिंदी भाषा का साहित्य, संस्कृति और इतिहास भी बहुत गौरवशाली है

PranjalChitale commented 5 months ago

By default, translate_paragraph splits the text into individual sentences using standard sentence tokenizers, translates each sentence separately, and then joins them with spaces. This method is recommended for general use, where the primary goal is to segment the paragraph and obtain the best possible translation by combining sentence-level translations, without focusing much on preserving the structure.

Creating a generic function to ensure structure preservation is not feasible, as each use case may have specific delimiters or no delimiter at all, leading to long-context inputs that the model cannot support.

To maintain the structure in your case, you can split the text by lines using "\n" or any specific delimiter in your document, translate each resulting sentence, and then rejoin them using "\n" or that delimiter.

Be aware that this approach might result in multiple sentences being combined into a single input or incomplete sentences being used as inputs. Translation performance might get affected with such long-context or incomplete inputs, as the model is designed for sentence-level translation with a maximum sequence length of 256 tokens, trained on complete sentence-level bitext pairs.

samayra2029 commented 5 months ago

Thanks @PranjalChitale.

One question, is the placeholder that can be ignored by translation? I am mainly looking for English to Indic languages? In that case, I can add that placeholder before sending for translation, and then replace that placeholder with "\n" back.

PranjalChitale commented 5 months ago

I suggested splitting the text using "\n" or another delimiter of your choice, I did not mention using placeholders. There's no guarantee that placeholders will be retained, even though empirically we do observe some placeholders being preserved. Moreover, using placeholders doesn't solve the length issue, as you're still limited by the maximum sequence length.

In my opinion, the simplest and most effective solution is to split the text on "\n", translate and then rejoin it with "\n". This method best preserves the structure with the current models.