OCR0018: Train a wylie decoder model

ta4tsering commented 4 months ago

Description: To test the TrOCR with Two different text input format, unicode and wylie. We need decoder model in the TrOCR, for Tibetan unicode decoder model we have SangeyDhongdrup reoberta model and but for the wylie we dont have a decoder model so we need to create one to test out the TrOCR performance with both the decoder models.

Implementation: Phase -1 new Wylie tokenizer workflow

Phase - 2 training a new Lanuage model for wylie

subtasks: Phase - 1

[x] download the data from sp's spsither/tibetan_monolingual_A_filtered_deduped to vast.ai
[x] write script to train new tokenizer for wylie
[x] train the tokenizer

Phase -2

Completion Citeria: Wylie Tibetan Language model, that can be used as the decoder model.

ta4tsering commented 4 months ago

@spsither can you review this for the wylie tokenizer pipeline ?

ta4tsering commented 4 months ago

done with the scripts and all but everytime I run the notebook in the sagemaker, next morning I am not able to open it. so I had to rerun it two times. I am going to talk with SP to iron out the issue

ta4tsering commented 4 months ago

figuring out the hugging face, was able to test out uploading data to the hugging face with parquet format through notebook ta4tsering/test

OpenPecha / wylie-tokenizer

OCR0018: Train a wylie decoder model #1