RFW0147: BO-EN Aligner refactor

Summary

We have a aligner pipeline which has few issues. We want those issues to be resolved by refactoring the pipeline.

Key Concepts

aligner: the aligner we are referring here is a pipeline which align Tibetan sentences with its equivalent english sentences

Context

We are getting translated articles and books from Tibetan to English and English to Tibetan. But these materials can't be use directly to train our machine translation model. In order to make them ready to train, we need to get sentence or segment pairs from those translated books or articles. Hence we have developed an aligner pipeline to get the books or article repo pairs and generate aligned pairs of segments. Publish those aligned pairs in another repo with TM as initials.

The current pipeline has following issues:

It lacks proper logging system. Due to which we are not able to check where our pipeline is failing
It is using GitHub api which has limit constraint. This results in breaking of code
Input of the aligner should have one pair id per line for better readability.
Multiple aligner should be running simultaneously in order to get alignment faster.

Outputs

Standard log of the process
No more github api limit error
running multiple aligner at same time

Inputs

Previous pipeline pkg
pair ids which needs to be aligned

Timeline

Specify the expected delivery date for the project.

References

Include any relevant links or resources for additional context or information.

OpenPecha / Requests