RFC00150: Creation of Largest Tibetan Monolingual Corpus(Open source)
Named Concepts
Monolingual Corpus: corpus includes texts in one language only.
Summary
To create the largest open-source Tibetan monolingual corpus, begin by downloading files from Amazon S3 buckets, focusing on Tibetan text data. Next, employ the cloud computing capabilities of vast.ai to run filter scripts, a crucial step that involves removing duplicate entries, conducting thorough quality checks, and filtering out any non-Tibetan text.
RFC00150: Creation of Largest Tibetan Monolingual Corpus(Open source)
Named Concepts
Monolingual Corpus: corpus includes texts in one language only.
Summary
To create the largest open-source Tibetan monolingual corpus, begin by downloading files from Amazon S3 buckets, focusing on Tibetan text data. Next, employ the cloud computing capabilities of vast.ai to run filter scripts, a crucial step that involves removing duplicate entries, conducting thorough quality checks, and filtering out any non-Tibetan text.
Dependencies
quality checker
Infrastructures
S3 bucket, vast.ai.
Design Illustrations
Testing
Describe the kind of testing procedures that are needed as part of fulfilling this request.
Implementation Steps
List all the steps involved during implementation.
Reviewed By
Who has reviewed the RFC?