Closed StephennFernandes closed 6 months ago
@prajdabre @jaygala24 hey guys could you please help with this.
I already do use paragraph_batch_translate_multilingual
function to batch translate long paragraphs, however each sample in the batch is used for a single users request and multiple concurrent requests are combined as a single batch (sort of my my own way of dynamic batching)
IndicTrans2 currently only supports translation at the sentence level. There's no guarantee that the model will preserve newline characters \n
, so I would recommend you to segment the data based on \n
and then utilize the paragraph_batch_translate_multilingual
function for translation and then rejoin using \n
.
@jaygala24 thanks a lot Jay.
addtionally, could you please confirm if paragraph_batch_translate_multilingual
can translate extremely long context (ie beyond 256 tokens) as I`ve been getting really long context results back when using as a script. But strangely, the function randomly throws a warning stating that the sentence is long and truncating to 256 tokens.
whats the context length for paragraph_batch_translate_multilingual
? how long of a sentence can i feed in without causing any issues ?
@StephennFernandes
The IndicTrans2 model is trained with a max_seq_len
of 256 tokens. The paragraph_batch_translate_multilingual
function operates by implicitly dividing paragraphs into individual sentences for translation, and then reassembling them. However, if a paragraph lacks appropriate sentence delimiters, the sentence splitting won't occur, resulting in a chunk larger than 256 tokens being passed and this is what triggers the warning.
This scenario is a corner case, primarily arising from documents lacking proper sentence delimiters. Without these delimiters, it is not possible to break the paragraph into constituent sentences. In such cases, the only viable option is to employ a sliding window approach. If any sentence in the batch exceeds 256 tokens, use the sliding window approach to further break it into chunks of 256 tokens, translate, and rejoin. It's important to note that translation quality cannot be guaranteed in these instances, as sentences are improperly segmented, and the model is not trained to handle such cases.
Closing this issue due to inactivity. Feel free to re-open the issue in case of any queries. Thanks!
hey ive been using the CTranslate2 model as inference. i handle concurrent user request by dynamically batching them together and inferencing the as a batch.
However i wanted to know i i could somehow handle new line
\n
in my input string and the translated response retained the new line token somehow (eg:\n
or any special token).Ive tried other alternative solutions like slicing the input string on the
\n
and inferencing these separate pieces as a batch and merging them together, but concurrent throughput in my application suffers.please let me know incase of a possible solution or a hacky way of how i could achieve this