Modularization for testing

skytin1004 commented 1 month ago

Hi @timothycdc,

I learned about this project through Lee and would like to contribute by enhancing the accuracy of the translation project through extensive testing and validation. To facilitate the creation of diverse test cases, I have modularized the project structure. You can refer to my draft pull request (PR #1).

I would like to discuss with the team if this approach aligns with our direction. If it is not yet determined, I suggest proceeding with this modular structure as a baseline for writing and validating test cases.

Please share your thoughts and feedback. If the team agrees with this approach, I will finalize the modularization and proceed with writing the test cases.

timothycdc commented 1 month ago

Hi @skytin1004, thank you for your interest in contributing.

Some comments:

Thanks for the config and modularisation, this is much needed because I think our code can be improved a lot.
Font management is useful too– some fonts only support certain languages.
We are giving a presentation on Aug 5th which involves a demo, I will accept this PR after the presentation.
I like your idea of testing, however, the issue is that LLMs are probabilistic and their outputs can vary. There could be more than 1 correct translation for a certain text or image. So I am worrieddifflib might not be sufficient.
- I think the more important areas of testing are to make sure that the translation process still works even if we modify different parts of it (e.g. using a different LLM, text detection algorithm, or a different text drawing method other than PIL).
- Since this is a demo repo after all, I want an easy way for developers to run translations on example markdown files and images that can cause edge cases here. This would probably be more helpful.
- I think that developers just want to modify the demo slightly to match their workflows, test run on md files, and make sure things look right.
- If you are also interested in benchmarking model translations, it is worth looking into using LLMs as a judge, and then perhaps coming up with some scoring or metrics. This would be a nicer feature to have down the road.

I want to highlight more information and other areas of priority to see if you are interested in contributing there.

Our team actually built a Django/GitHub app for our university demo (it is private but I am uploading a public version for you). The problem we have is that many users just want to demo the capabilities of LLMs without having to host/install their own GitHub app.

Currently in this repo, we only have one working notebook which translates local images.

So my goal for this repo is to have some python scripts that can translate markdown files: The idea is that they look for .md and image files in a folder and run the necessary image/text translations, and produce new translated images/md files in an output directory. And then we can have another notebook for the same feature so devs can play around with it.

My team already wrote most of the important logic in the app but since it is badly structured, we don't want to focus on the app anymore, just a public repo for others to try out with their own local examples.

The translation process I am planning is like this: (similar to the app)

Find out the translation language from the user. We use 2-letter iso codes, and reference against a yml file like this to get the correct font for image translation
Translate all image files following the same method in the repo notebook. Store them in the output directory
For each md file, separate text into chunks to make sure we are in token limits for each request sent to OpenAI Azure.
For each chunk, add a translation prompt on top of it. Send it to the LLM to translate. We used asyncio for the app but I think we should change to nest-asyncio for nested loops, which cause less errors.
Then combine the translated chunks back together.
Then run regex on markdown links and replace them with links to the translated images.
- In the app, we had a hashing function with markdown images to prevent name collisions in the GitHub repo. This is unnecessary for the demo because we will be storing all images in the same directory.

The translation logic for our app can be found here
- Interestingly, the LLM would hallucinate translations for very short sequences of text (that are 1-2 lines long). We use a conditional prompt for a simple workaround. More details here.
Chunking notebook. We use TikToken to count tokens for GPT4o. GPT4o mini was recently released, not sure if it is the same tokenizer.

skytin1004 commented 1 month ago

Hi @timothycdc,

Thank you for your feedback. I appreciate your suggestions.

I agree that due to the probabilistic nature of LLMs, difflib may not be sufficient for accurate evaluation.

I will look into better methods to evaluate diverse translation outputs accurately. The idea of using LLMs as a judge for benchmarking translations also seems good.

I understand that you will accept the PR after the presentation on August 5th. I wish you all the best in your preparations.

In the meantime, I will check the repositories you shared.

I think it would be beneficial to create a notebook that imports the modularized code and allows users to perform the translation process step by step.

Once my PR is works correctly, I will change its status from Draft PR to Open PR and add a comment to let you know.

Thank you again for your guidance.

skytin1004 commented 1 month ago

@timothycdc , Could you please check this file twitter.py? It looks like there might be an exposed key.

timothycdc commented 1 month ago

Hi, I’m not at my computer now. Thanks for the spot — I’ve asked my team member to disable it for now

Imperial-EE-Microsoft / microsoft_translation_public

Modularization for testing #2