OpenPecha / Requests

RFWs and RFCs for all OpenPecha repositories
0 stars 0 forks source link

[RFC0031] Automate text distribution for pre-alignment and manual correction task #104

Closed 10zinten closed 1 year ago

10zinten commented 1 year ago

Work Planning

Details

Table of Contents

- [Housekeeping](#housekeeping) - [Named Concepts](#named-concepts) - [Summary](#summary) - [Reference-Level Explanation](#reference-level-explanation) - [Alternatives](#alternatives) * [Rationale](#rationale) - [Drawbacks](#drawbacks) - [Useful References](#useful-references) - [Unresolved questions](#unresolved-questions) - [Parts of the system affected](#parts-of-the-system-affected) - [Future possibilities](#future-possibilities) - [Infrastructure](#infrastructure) - [Testing](#testing) - [Documentation](#documentation) - [Version History](#version-history) - [Recordings](#recordings) - [Work Phases](#work-phases)

Housekeeping

*Please add ref in specified format into `RFC` title, e.g `[RFC9999]` if corresponding RFW is `[RFW9999]`.* *Please add into this `RFC` and related `PR's` titles `[RFC_id]` e.g `[RFC_9999]`.* ALL BELOW FIELDS ARE REQUIRED

Named Concepts

- **pre-alignment**: use text aligner software (like [vecalign](https://github.com/thompsonb/vecalign)) to align text pair prior to manual correction of the alignments. - **MT**: Machine Translation

Summary

Automating text distribution for pre-alignment and manual correction tasks in Monlam AI MT dataset generation workflow.

Reference-Level Explanation

![Monlam AI - MT 2023-03-20 15 19 49 excalidraw](https://user-images.githubusercontent.com/16164304/226313808-256fddce-9ec9-4406-bf57-0d1eb678b774.png) **Auto Align Workflow** ![auto_align_workflow](https://user-images.githubusercontent.com/16164304/230068831-7f9021cd-43aa-4f74-b683-999fe73528d4.png)

Alternatives

Manually distribution the texts

Rationale

To avoid time consuming low ROI tasks

Drawbacks

Need to developer to maintain the automation

Useful References

Text Pair Collection: https://github.com/OpenPecha-Data/C1A81F448/ *Describe useful parallels and learnings from other requests, or work in previous projects.* - What similar work have we already successfully completed? - Is this something that have already been built by others? - What other related learnings we have? - Are there useful academic literature or other articles related with this topic? (provide links) - Have we built a relevant prototype previously? - Do we have a rough mock for the UI/UX? - Do we have a schematic for the system?

Unresolved Questions

- What is there that is unresolved (and will be resolved as part of fulfilling this request)? - Are there other requests with same or similar problems to solve?

Parts of the System Affected

- Which parts of the current system are affected by this request? NA - What other open requests are closely related with this request? NA - Does this request depend on fulfillment of any other request? NA - Does any other request depend on the fulfillment of this request?* NA

Future possibilities

web app that can: - detect when new text pair is added and start the automation - track progress

Infrastructure

- All the scripts are run on Github Actions - All scripts are written in python - All the data are stored on Github

Testing

- [ ] test if opfs are created to text pair - [ ] test if opfs are added to collection - [ ] test if views are generated for text pair - [ ] test if views are copied to /raw-alignments/input - [ ] test if /raw-alignments/output are exported to MT repo

Documentation

*Describe the level of documentation fulfilling this request involves. Consider both end-user documentation and developer documentation.*

Version History

*History of changes to this RFC. Following semantic versioning pattern and v0.1.2 for style.*

Recordings

*Links to audio recordings of related discussion.*

Work Phases

Planning

Keep original naming and structure, and keep as first section in Work phases section

Implementation

A list of checkboxes, one per PR. Each PR should have a descriptive name that clearly illustrates what the work phase is about.

Completion

kaldan007 commented 1 year ago

Looks great to me