AI4Bharat / IndicTrans2

Translation models for 22 scheduled languages of India
https://ai4bharat.iitm.ac.in/indic-trans2
MIT License
217 stars 59 forks source link

How to handle large documents? #34

Closed kdcyberdude closed 8 months ago

kdcyberdude commented 8 months ago

I'm looking to translate large documents into English, but I'm encountering an issue with the maximum sequence length of 256 while translating. In some instances, even after splitting the document, some sentences are still longer than 256 tokens. This situation might potentially impact the global context. Could you provide me with any suggestions or recommendations to handle this effectively?

@jaygala24

prajdabre commented 8 months ago

Hi IndicTrans2 was trained for sentence level translation so passing documents won't work. Best you can do is break documents into sentences based on punctuation, translate segments, assemble.

jaygala24 commented 8 months ago

Hi @kdcyberdude

IndicTrans2 currently supports sentence-level translation as mentioned by my colleague @prajdabre. You can model.translate_paragraph usage in the inference section. We will add the paragraph translation support in the huggingface example in the coming week. Thank you!