RFC00150: Creation of Largest Tibetan Monolingual Corpus(Open source)

Named Concepts

Monolingual Corpus: corpus includes texts in one language only.

Summary

To create the largest open-source Tibetan monolingual corpus, begin by downloading files from Amazon S3 buckets, focusing on Tibetan text data. Next, employ the cloud computing capabilities of vast.ai to run filter scripts, a crucial step that involves removing duplicate entries, conducting thorough quality checks, and filtering out any non-Tibetan text.

Dependencies

quality checker

Infrastructures

S3 bucket, vast.ai.

Design Illustrations

Filter

Testing

Describe the kind of testing procedures that are needed as part of fulfilling this request.

Implementation Steps

List all the steps involved during implementation.

[ ] OpenPecha/TibCleaner#2 Estimated time: 1 Actual time:
[ ] OpenPecha/TibCleaner#3 Estimated time: .5 Actual time:
[ ] OpenPecha/TibCleaner#4 Estimated time: .5 Actual time:

Reviewed By

Who has reviewed the RFC?

OpenPecha / Requests