OpenPecha / Requests

RFWs and RFCs for all OpenPecha repositories
0 stars 0 forks source link

[RFC00150] Creation of Largest Tibetan Monolingual Corpus (Open source) #440

Open tenzin3 opened 5 months ago

tenzin3 commented 5 months ago

RFC00150: Creation of Largest Tibetan Monolingual Corpus(Open source)

Named Concepts

Monolingual Corpus: corpus includes texts in one language only.

Summary

To create the largest open-source Tibetan monolingual corpus, begin by downloading files from Amazon S3 buckets, focusing on Tibetan text data. Next, employ the cloud computing capabilities of vast.ai to run filter scripts, a crucial step that involves removing duplicate entries, conducting thorough quality checks, and filtering out any non-Tibetan text.

Dependencies

quality checker

Infrastructures

S3 bucket, vast.ai.

Design Illustrations

Filter

Testing

Describe the kind of testing procedures that are needed as part of fulfilling this request.

Implementation Steps

List all the steps involved during implementation.

Reviewed By

Who has reviewed the RFC?