We have a aligner pipeline which has few issues. We want those issues to be resolved by refactoring the pipeline.
Key Concepts
aligner: the aligner we are referring here is a pipeline which align Tibetan sentences with its equivalent english sentences
Context
We are getting translated articles and books from Tibetan to English and English to Tibetan. But these materials can't be use directly to train our machine translation model. In order to make them ready to train, we need to get sentence or segment pairs from those translated books or articles. Hence we have developed an aligner pipeline to get the books or article repo pairs and generate aligned pairs of segments. Publish those aligned pairs in another repo with TM as initials.
The current pipeline has following issues:
It lacks proper logging system. Due to which we are not able to check where our pipeline is failing
It is using GitHub api which has limit constraint. This results in breaking of code
Input of the aligner should have one pair id per line for better readability.
Multiple aligner should be running simultaneously in order to get alignment faster.
Outputs
Standard log of the process
No more github api limit error
running multiple aligner at same time
Inputs
Previous pipeline pkg
pair ids which needs to be aligned
Timeline
Specify the expected delivery date for the project.
References
Include any relevant links or resources for additional context or information.
RFW0147: BO-EN Aligner refactor
Summary
We have a aligner pipeline which has few issues. We want those issues to be resolved by refactoring the pipeline.
Key Concepts
aligner: the aligner we are referring here is a pipeline which align Tibetan sentences with its equivalent english sentences
Context
We are getting translated articles and books from Tibetan to English and English to Tibetan. But these materials can't be use directly to train our machine translation model. In order to make them ready to train, we need to get sentence or segment pairs from those translated books or articles. Hence we have developed an aligner pipeline to get the books or article repo pairs and generate aligned pairs of segments. Publish those aligned pairs in another repo with TM as initials.
The current pipeline has following issues:
Outputs
Inputs
Timeline
Specify the expected delivery date for the project.
References
Include any relevant links or resources for additional context or information.