andrewyng / translation-agent

MIT License
4.83k stars 553 forks source link

A question regarding the translation of long texts. #28

Closed weizjajj closed 4 months ago

weizjajj commented 5 months ago

In your code, to ensure that long texts do not exceed the model's maximum token limit, you have segmented the text. However, during the actual translation process, all the segmented pieces are included in the context. problem

j-dominguez9 commented 4 months ago

Hi @weizjajj , yes that is accurate. We found in our testing that including all the text allows for better reflection/translations since the model has context of the entire text, which would be lost if it is processed piecemeal. Even if this limit is under the 4k allowed output from current LLM vendors, the reflections--and thus, translations--were better when processing at a smaller scale (~1k tokens). If you have any further questions, let me know. If not, we'll close this issue and thank you for trying out the translation-agent!

siddhantx0 commented 4 months ago

Wow

On Fri, Jun 28, 2024 at 5:54 PM Joaquin Dominguez @.***> wrote:

Hi @weizjajj https://github.com/weizjajj , yes that is accurate. We found in our testing that including all the text allows for better reflection/translations since the model has context of the entire text, which would be lost if it is processed piecemeal. Even if this limit is under the 4k allowed output from current LLM vendors, the reflections--and thus, translations--were better when processing at a smaller scale (~1k tokens). If you have any further questions, let me know. If not, we'll close this issue and thank you for trying out the translation-agent!

— Reply to this email directly, view it on GitHub https://github.com/andrewyng/translation-agent/issues/28#issuecomment-2197748745, or unsubscribe https://github.com/notifications/unsubscribe-auth/AY7DCNDGAEEY3PX2IGCQ3A3ZJXSRTAVCNFSM6AAAAABJ47BMRCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOJXG42DQNZUGU . You are receiving this because you are subscribed to this thread.Message ID: @.***>

weizjajj commented 4 months ago

Thank you for your response. If I understand correctly, the reason for segmenting the text is to have the model translate only about 1,000 words at a time, and in such cases, outputs of roughly 1,000 or slightly fewer tokens result in more accurate translations than those exceeding this amount, correct? However, doesn't this scenario pose an issue with handling input texts that exceed the model's max_context_length? In my testing process, when attempting to translate the first chapter of a novel from English to Chinese, I encountered an error indicating the input exceeded the model's maximum token limit.

j-dominguez9 commented 4 months ago

That would be a case that is not accounted for if it exceeds the model's max_context_length, you're right. However, to the extent that the entire text contributes to improving the translation of a section,(although we didn't test it) I would surmise that it would not have much of a benefit if your text is >8k tokens. So in that case, I don't suppose that the quality would be affected if you break up the text into sections that do fit in the context_length. That is, if your text is greater than max_context_length, it would not affect the quality of the translation to break it up in the same way that it would be if you don't include the full text for the translation of a section.

methanet commented 4 months ago

@weizjajj closing this for now and we welcome follow up questions and PRs in case you see an opportunity for improvements.