OpenPecha / wylie-tokenizer

MIT License
0 stars 0 forks source link

OCR0018: Train a wylie decoder model #1

Open ta4tsering opened 4 months ago

ta4tsering commented 4 months ago

Description: To test the TrOCR with Two different text input format, unicode and wylie. We need decoder model in the TrOCR, for Tibetan unicode decoder model we have SangeyDhongdrup reoberta model and but for the wylie we dont have a decoder model so we need to create one to test out the TrOCR performance with both the decoder models.

Implementation: Phase -1 new Wylie tokenizer workflow

Image

Phase - 2 training a new Lanuage model for wylie

subtasks: Phase - 1

Completion Citeria: Wylie Tibetan Language model, that can be used as the decoder model.

ta4tsering commented 4 months ago

@spsither can you review this for the wylie tokenizer pipeline ?

ta4tsering commented 4 months ago

done with the scripts and all but everytime I run the notebook in the sagemaker, next morning I am not able to open it. so I had to rerun it two times. I am going to talk with SP to iron out the issue

ta4tsering commented 4 months ago

figuring out the hugging face, was able to test out uploading data to the hugging face with parquet format through notebook ta4tsering/test