*Please add ref in specified format into `RFC` title, e.g `[RFC9999]` if corresponding RFW is `[RFW9999]`.*
*Please add into this `RFC` and related `PR's` titles `[RFC_id]` e.g `[RFC_9999]`.*
ALL BELOW FIELDS ARE REQUIRED
Named Concepts
- **pre-alignment**: use text aligner software (like [vecalign](https://github.com/thompsonb/vecalign)) to align text pair prior to manual correction of the alignments.
- **MT**: Machine Translation
Summary
Automating text distribution for pre-alignment and manual correction tasks in Monlam AI MT dataset generation workflow.
Text Pair Collection: https://github.com/OpenPecha-Data/C1A81F448/
*Describe useful parallels and learnings from other requests, or work in previous projects.*
- What similar work have we already successfully completed?
- Is this something that have already been built by others?
- What other related learnings we have?
- Are there useful academic literature or other articles related with this topic? (provide links)
- Have we built a relevant prototype previously?
- Do we have a rough mock for the UI/UX?
- Do we have a schematic for the system?
Unresolved Questions
- What is there that is unresolved (and will be resolved as part of fulfilling this request)?
- Are there other requests with same or similar problems to solve?
Parts of the System Affected
- Which parts of the current system are affected by this request? NA
- What other open requests are closely related with this request? NA
- Does this request depend on fulfillment of any other request? NA
- Does any other request depend on the fulfillment of this request?* NA
Future possibilities
web app that can:
- detect when new text pair is added and start the automation
- track progress
Infrastructure
- All the scripts are run on Github Actions
- All scripts are written in python
- All the data are stored on Github
Testing
- [ ] test if opfs are created to text pair
- [ ] test if opfs are added to collection
- [ ] test if views are generated for text pair
- [ ] test if views are copied to /raw-alignments/input
- [ ] test if /raw-alignments/output are exported to MT repo
Documentation
*Describe the level of documentation fulfilling this request involves. Consider both end-user documentation and developer documentation.*
Version History
*History of changes to this RFC. Following semantic versioning pattern and v0.1.2 for style.*
Recordings
*Links to audio recordings of related discussion.*
Work Phases
Planning
Keep original naming and structure, and keep as first section in Work phases section
[x] RFC completed on:
Estimated time: 2 weeks
Actual time: 3 weeks
[x] RFC reviewed and approved by:
Estimated time:
Actual time:
Implementation
A list of checkboxes, one per PR. Each PR should have a descriptive name that clearly illustrates what the work phase is about.
[x] OpenPecha/mt-training-data-prep-tools#7
Estimated time: 5 hrs
Actual time: 1 day
[x] OpenPecha/Data-Pipeline-Manager#85
Estimated time: 5 hrs
Actual time: 1 day
[x] OpenPecha/mt-training-data-prep-tools#2
Estimated time: 2hrs
Actual time: 1 day
[x] OpenPecha/mt-training-data-prep-tools#3
Estimated time: 3hrs
Actual time: 3 days
[x] OpenPecha/mt-training-data-prep-tools#4
Estimated time: 4hrs
Actual time: 7 days
[x] OpenPecha/mt-training-data-prep-tools#8
Estimated time: 2hrs
Actual time: 9 days
[x] OpenPecha/tibetan-aligner-hf-space#1
Estimated time: 2days
Actual time: 9 days
[x] OpenPecha/mt-training-data-prep-tools#5
Estimated time: 4hrs
Actual time: 20 days
Completion
[x] Tested and approved by: @username, @username
Estimated time:
Actual time:
[x] Documentation approved @evanyerburgh
Estimated time:
Actual time:
Work Planning
Details
Table of Contents
- [Housekeeping](#housekeeping) - [Named Concepts](#named-concepts) - [Summary](#summary) - [Reference-Level Explanation](#reference-level-explanation) - [Alternatives](#alternatives) * [Rationale](#rationale) - [Drawbacks](#drawbacks) - [Useful References](#useful-references) - [Unresolved questions](#unresolved-questions) - [Parts of the system affected](#parts-of-the-system-affected) - [Future possibilities](#future-possibilities) - [Infrastructure](#infrastructure) - [Testing](#testing) - [Documentation](#documentation) - [Version History](#version-history) - [Recordings](#recordings) - [Work Phases](#work-phases)Housekeeping
*Please add ref in specified format into `RFC` title, e.g `[RFC9999]` if corresponding RFW is `[RFW9999]`.* *Please add into this `RFC` and related `PR's` titles `[RFC_id]` e.g `[RFC_9999]`.* ALL BELOW FIELDS ARE REQUIREDNamed Concepts
- **pre-alignment**: use text aligner software (like [vecalign](https://github.com/thompsonb/vecalign)) to align text pair prior to manual correction of the alignments. - **MT**: Machine TranslationSummary
Automating text distribution for pre-alignment and manual correction tasks in Monlam AI MT dataset generation workflow.Reference-Level Explanation
![Monlam AI - MT 2023-03-20 15 19 49 excalidraw](https://user-images.githubusercontent.com/16164304/226313808-256fddce-9ec9-4406-bf57-0d1eb678b774.png) **Auto Align Workflow** ![auto_align_workflow](https://user-images.githubusercontent.com/16164304/230068831-7f9021cd-43aa-4f74-b683-999fe73528d4.png)Alternatives
Manually distribution the textsRationale
To avoid time consuming low ROI tasksDrawbacks
Need to developer to maintain the automationUseful References
Text Pair Collection: https://github.com/OpenPecha-Data/C1A81F448/ *Describe useful parallels and learnings from other requests, or work in previous projects.* - What similar work have we already successfully completed? - Is this something that have already been built by others? - What other related learnings we have? - Are there useful academic literature or other articles related with this topic? (provide links) - Have we built a relevant prototype previously? - Do we have a rough mock for the UI/UX? - Do we have a schematic for the system?Unresolved Questions
- What is there that is unresolved (and will be resolved as part of fulfilling this request)? - Are there other requests with same or similar problems to solve?Parts of the System Affected
- Which parts of the current system are affected by this request? NA - What other open requests are closely related with this request? NA - Does this request depend on fulfillment of any other request? NA - Does any other request depend on the fulfillment of this request?* NAFuture possibilities
web app that can: - detect when new text pair is added and start the automation - track progressInfrastructure
- All the scripts are run on Github Actions - All scripts are written in python - All the data are stored on GithubTesting
- [ ] test if opfs are created to text pair - [ ] test if opfs are added to collection - [ ] test if views are generated for text pair - [ ] test if views are copied to /raw-alignments/input - [ ] test if /raw-alignments/output are exported to MT repoDocumentation
*Describe the level of documentation fulfilling this request involves. Consider both end-user documentation and developer documentation.*Version History
*History of changes to this RFC. Following semantic versioning pattern and v0.1.2 for style.*Recordings
*Links to audio recordings of related discussion.*Work Phases
Planning
Keep original naming and structure, and keep as first section in Work phases section
Implementation
A list of checkboxes, one per PR. Each PR should have a descriptive name that clearly illustrates what the work phase is about.
Completion