AI4Bharat / IndicTrans2

Translation models for 22 scheduled languages of India
https://ai4bharat.iitm.ac.in/indic-trans2
MIT License
214 stars 59 forks source link

Context Window Limited to 512 tokens. #53

Closed h2210316651 closed 5 months ago

h2210316651 commented 5 months ago

Dear AI4Bharat Team,

I'm writing to report a limitation I've encountered while using the IndicTrans2 large language model for machine translation tasks. The current context window of 512 characters hinders its applicability in real-world scenarios that often involve longer text passages.

IndicTrans2's ability to translate between English and Indian languages, as well as between Indic languages themselves, is a valuable contribution. However, the limited context window restricts the model's ability to capture the full context of longer sequences, potentially leading to inaccurate or nonsensical translations.

I would like to request the consideration of releasing models with a higher context window size. Ideally, a window size of 32,000 characters would significantly improve the model's capabilities for real-world tasks.

I understand that increasing the context window size might come with computational costs. However, the ability to handle longer sequences would greatly enhance the usability and effectiveness of IndicTrans2.

Thank you for your time and consideration.

prajdabre commented 5 months ago

Hello,

Thanks for using IndicTrans2. The limitation of 512 characters (roughly 256 subwords) you mention is something we have clearly mentioned in the research paper along with other limitations. We recommend going through the said section in the paper for more information. If you check other issues others have raised regarding sequence length, then you will see solutions for the same. A naive solution is: break down a document into sentences, translate sentences, rejoin. Especially, issues on translating documents will be helpful.

Regarding long context being helpful for MT, we are aware of works showing that up to 3 or 4 preceding sentences, up to 1024 subwords, are helpful in improving document level MT but we are not aware of works showing the benefits of 32000 context length. We do know that this is helpful in LLMs which deal with long conversations and discourse but for MT, we are not aware of such works. Requesting you to point us out to such works.

As for the 32000 context length, it's as you say, the task is computationally expensive and we would need substantial compute which we sadly do not have. However we welcome any one to update the model with such capabilities and make a pull request.

Regards.