AI4Bharat / IndicTrans2

Translation models for 22 scheduled languages of India
https://ai4bharat.iitm.ac.in/indic-trans2
MIT License
217 stars 59 forks source link

how to translate multi-line strings #42

Closed StephennFernandes closed 6 months ago

StephennFernandes commented 7 months ago

hey ive been using the CTranslate2 model as inference. i handle concurrent user request by dynamically batching them together and inferencing the as a batch.

However i wanted to know i i could somehow handle new line \n in my input string and the translated response retained the new line token somehow (eg: \n or any special token).

Ive tried other alternative solutions like slicing the input string on the \n and inferencing these separate pieces as a batch and merging them together, but concurrent throughput in my application suffers.

please let me know incase of a possible solution or a hacky way of how i could achieve this

StephennFernandes commented 7 months ago

@prajdabre @jaygala24 hey guys could you please help with this.

I already do use paragraph_batch_translate_multilingual function to batch translate long paragraphs, however each sample in the batch is used for a single users request and multiple concurrent requests are combined as a single batch (sort of my my own way of dynamic batching)

jaygala24 commented 7 months ago

IndicTrans2 currently only supports translation at the sentence level. There's no guarantee that the model will preserve newline characters \n, so I would recommend you to segment the data based on \n and then utilize the paragraph_batch_translate_multilingual function for translation and then rejoin using \n.

StephennFernandes commented 7 months ago

@jaygala24 thanks a lot Jay.

addtionally, could you please confirm if paragraph_batch_translate_multilingual can translate extremely long context (ie beyond 256 tokens) as I`ve been getting really long context results back when using as a script. But strangely, the function randomly throws a warning stating that the sentence is long and truncating to 256 tokens.

whats the context length for paragraph_batch_translate_multilingual ? how long of a sentence can i feed in without causing any issues ?

PranjalChitale commented 7 months ago

@StephennFernandes

The IndicTrans2 model is trained with a max_seq_len of 256 tokens. The paragraph_batch_translate_multilingual function operates by implicitly dividing paragraphs into individual sentences for translation, and then reassembling them. However, if a paragraph lacks appropriate sentence delimiters, the sentence splitting won't occur, resulting in a chunk larger than 256 tokens being passed and this is what triggers the warning.

This scenario is a corner case, primarily arising from documents lacking proper sentence delimiters. Without these delimiters, it is not possible to break the paragraph into constituent sentences. In such cases, the only viable option is to employ a sliding window approach. If any sentence in the batch exceeds 256 tokens, use the sliding window approach to further break it into chunks of 256 tokens, translate, and rejoin. It's important to note that translation quality cannot be guaranteed in these instances, as sentences are improperly segmented, and the model is not trained to handle such cases.

jaygala24 commented 6 months ago

Closing this issue due to inactivity. Feel free to re-open the issue in case of any queries. Thanks!