Open spsither opened 6 months ago
Trained 2 tokenizers on 15GB Gold data and 45GB A
Filtering the A data with this function and doing some regex cleaning
def max_char_btw_tsak(example):
segments = example.split('་')
max_length = max(len(segment) for segment in segments if segment)
return max_length
Lopa says བསྒྲིགས། is the maximum valid char between tsak
Taking 8 as the filtering threshold
filtered the 45GB A by using the following conditions
max_char_btw_tsak
is the max syllable length in a sentence
char_len
is the character length of the sentence
max_char_btw_tsak
> 1max_char_btw_tsak
< 9char_len
> 15char_len
< 1000 This has A with max_char_btw_tsak
and char_len
metadata
https://huggingface.co/datasets/spsither/tibetan_monolingual_A_meta
Filtered A is at https://huggingface.co/datasets/spsither/tibetan_monolingual_A_filtered
Filtered and Deduped A on sentence is at https://huggingface.co/datasets/spsither/tibetan_monolingual_A_filtered_deduped
Action | Train size | Test size | % drop |
---|---|---|---|
Original | 335968205 | 18500104 | 0 |
Filtered | 293280837 | 16162978 | 12.7 |
Dedupe + Filtered | 93214402 | 11561844 | 70.4 |
45GB -> 18GB
data is still not clean bad tokens examples
Started training RoBERTa Large on 18GB A_filtered_deduped data at Apr 26th 4:50 PM Using ml.g5.4xlarge
Most of the text data we have is about Buddhism, might be good to use these for training
I used Botok to filter out sentences with any bad tokens. We now have 16GB of text data. The dataset is here. This dataset consists of the following
The default trainer API uses DP instead of DDP. I am reading up accelerate so we can use DDP or more advanced techniques.
Description
Allign segment embedding and transcript embedding. Use a similarity threshold for quality checks.
Completion Criteria
Push the pair of models to HF.
Implementation Plan
Pre-train a BERT-style encoder model for Tibetan. Fine-tune wav2vec2 and BERT-style model to align the embedding vectors.
Subtasks