Open ta4tsering opened 4 months ago
@spsither can you review this for the wylie tokenizer pipeline ?
done with the scripts and all but everytime I run the notebook in the sagemaker, next morning I am not able to open it. so I had to rerun it two times. I am going to talk with SP to iron out the issue
figuring out the hugging face, was able to test out uploading data to the hugging face with parquet format through notebook ta4tsering/test
Description: To test the TrOCR with Two different text input format, unicode and wylie. We need decoder model in the TrOCR, for Tibetan unicode decoder model we have SangeyDhongdrup reoberta model and but for the wylie we dont have a decoder model so we need to create one to test out the TrOCR performance with both the decoder models.
Implementation: Phase -1 new Wylie tokenizer workflow
Phase - 2 training a new Lanuage model for wylie
subtasks: Phase - 1
spsither/tibetan_monolingual_A_filtered_deduped
to vast.ai[x] train the tokenizer
Phase -2
Completion Citeria: Wylie Tibetan Language model, that can be used as the decoder model.