RFC0147: BO-EN Aligner refactor

Named Concepts

aligner: the aligner we are referring here is a pipeline which align Tibetan sentences with its equivalent english sentences

Summary

We are modifying the existing aligner pipeline that has a few issues. Issues includes

Implementing Logging: Introduce a comprehensive logging system for better monitoring and error tracking of the whole process.
Overcoming GitHub API Limits: Migrate the code to the BDRC server and use subprocess to avoid GitHub's API rate limits when handling files.
Parallel Processing of Aligners: Move the main code to the BDRC server and use multiprocessing enabling to send multiple aligner requests simultaneously and on the Hugging Face server, adapt the code to handle and perform multiple alignments simultaneously, enhancing overall efficiency.

Dependencies

Infrastructures

BDRC server access
Hugging face model access

Design Illustrations

Diagram Server Aligner logs

Justification

The current aligner pipeline faces key issues: it lacks a robust logging system, making it difficult to diagnose failures, and its reliance on the GitHub API leads to frequent code breakdowns due to rate limits. Additionally, the input format needs refinement for better readability, and the inability to run multiple aligners simultaneously slows down the alignment process. Addressing these issues is essential for improving the pipeline's efficiency and reliability.

Testing

Test the enhanced aligner pipeline with 10 pairs of Tibetan (BO) and English (EN) files to assess parallel processing and logging efficiency

Implementation Steps

List all the steps involved during implementation.

[x] OpenPecha/mt-aligner-prep-tool#1 Estimated time: 2 hours
Actual time:
[x] OpenPecha/mt-aligner-prep-tool#2 Estimated time: 1 hour Actual time:
[x] OpenPecha/mt-aligner-prep-tool#3 Estimated time: 2 hours Actual time:
[x] Aligner API: Refactoring the current Gradio app to create an efficient API. Estimated time:
Actual time:
[x] Server: Implement multiprocessing techniques to enhance the speed of alignment tasks. Estimated time:
Actual time:

Reviewed By

@TenzinGayche

OpenPecha / Requests