aligner: the aligner we are referring here is a pipeline which align Tibetan sentences with its equivalent english sentences
Summary
We are modifying the existing aligner pipeline that has a few issues.
Issues includes
Implementing Logging: Introduce a comprehensive logging system for better monitoring and error tracking of the whole process.
Overcoming GitHub API Limits: Migrate the code to the BDRC server and use subprocess to avoid GitHub's API rate limits when handling files.
Parallel Processing of Aligners: Move the main code to the BDRC server and use multiprocessing enabling to send multiple aligner requests simultaneously and on the Hugging Face server, adapt the code to handle and perform multiple alignments simultaneously, enhancing overall efficiency.
Dependencies
Infrastructures
BDRC server access
Hugging face model access
Design Illustrations
Justification
The current aligner pipeline faces key issues: it lacks a robust logging system, making it difficult to diagnose failures, and its reliance on the GitHub API leads to frequent code breakdowns due to rate limits. Additionally, the input format needs refinement for better readability, and the inability to run multiple aligners simultaneously slows down the alignment process. Addressing these issues is essential for improving the pipeline's efficiency and reliability.
Testing
Test the enhanced aligner pipeline with 10 pairs of Tibetan (BO) and English (EN) files to assess parallel processing and logging efficiency
Implementation Steps
List all the steps involved during implementation.
[x] OpenPecha/mt-aligner-prep-tool#1
Estimated time: 2 hours
Actual time:
[x] OpenPecha/mt-aligner-prep-tool#2
Estimated time: 1 hour
Actual time:
[x] OpenPecha/mt-aligner-prep-tool#3
Estimated time: 2 hours
Actual time:
[x] Aligner API: Refactoring the current Gradio app to create an efficient API.
Estimated time:
Actual time:
[x] Server: Implement multiprocessing techniques to enhance the speed of alignment tasks.
Estimated time:
Actual time:
RFC0147: BO-EN Aligner refactor
Named Concepts
aligner: the aligner we are referring here is a pipeline which align Tibetan sentences with its equivalent english sentences
Summary
We are modifying the existing aligner pipeline that has a few issues. Issues includes
Dependencies
Infrastructures
Design Illustrations
Justification
The current aligner pipeline faces key issues: it lacks a robust logging system, making it difficult to diagnose failures, and its reliance on the GitHub API leads to frequent code breakdowns due to rate limits. Additionally, the input format needs refinement for better readability, and the inability to run multiple aligners simultaneously slows down the alignment process. Addressing these issues is essential for improving the pipeline's efficiency and reliability.
Testing
Test the enhanced aligner pipeline with 10 pairs of Tibetan (BO) and English (EN) files to assess parallel processing and logging efficiency
Implementation Steps
List all the steps involved during implementation.
[x] OpenPecha/mt-aligner-prep-tool#1 Estimated time: 2 hours
Actual time:
[x] OpenPecha/mt-aligner-prep-tool#2 Estimated time: 1 hour Actual time:
[x] OpenPecha/mt-aligner-prep-tool#3 Estimated time: 2 hours Actual time:
[x] Aligner API: Refactoring the current Gradio app to create an efficient API. Estimated time:
Actual time:
[x] Server: Implement multiprocessing techniques to enhance the speed of alignment tasks. Estimated time:
Actual time:
Reviewed By
@TenzinGayche