OpenPecha / RoBERTa

MIT License
0 stars 0 forks source link

STT0021: Quality check with Vecalign for Audio-Transcript #1

Open spsither opened 6 months ago

spsither commented 6 months ago

Description

Allign segment embedding and transcript embedding. Use a similarity threshold for quality checks.

Completion Criteria

Push the pair of models to HF.


Implementation Plan

Pre-train a BERT-style encoder model for Tibetan. Fine-tune wav2vec2 and BERT-style model to align the embedding vectors.

Subtasks

spsither commented 6 months ago

Reference blog and [code]()

spsither commented 6 months ago

Trained 2 tokenizers on 15GB Gold data and 45GB A

spsither commented 6 months ago

Filtering the A data with this function and doing some regex cleaning

def max_char_btw_tsak(example):
    segments = example.split('་')
    max_length = max(len(segment) for segment in segments if segment)
    return max_length
spsither commented 6 months ago

Lopa says བསྒྲིགས། is the maximum valid char between tsak

Taking 8 as the filtering threshold

spsither commented 6 months ago

filtered the 45GB A by using the following conditions max_char_btw_tsak is the max syllable length in a sentence char_len is the character length of the sentence

spsither commented 6 months ago

This has A with max_char_btw_tsak and char_len metadata https://huggingface.co/datasets/spsither/tibetan_monolingual_A_meta

Filtered A is at https://huggingface.co/datasets/spsither/tibetan_monolingual_A_filtered

Filtered and Deduped A on sentence is at https://huggingface.co/datasets/spsither/tibetan_monolingual_A_filtered_deduped

spsither commented 6 months ago
Action Train size Test size % drop
Original 335968205 18500104 0
Filtered 293280837 16162978 12.7
Dedupe + Filtered 93214402 11561844 70.4

45GB -> 18GB

spsither commented 6 months ago

data is still not clean bad tokens examples

spsither commented 6 months ago

Started training RoBERTa Large on 18GB A_filtered_deduped data at Apr 26th 4:50 PM Using ml.g5.4xlarge

spsither commented 6 months ago

Image

spsither commented 6 months ago

Most of the text data we have is about Buddhism, might be good to use these for training

https://huggingface.co/datasets/oscar-corpus/OSCAR-2201

https://huggingface.co/datasets/oscar-corpus/OSCAR-2301

kaldan007 commented 6 months ago

https://zenodo.org/records/3951503

spsither commented 6 months ago

Word Segmentation data from Marieke is here

spsither commented 6 months ago

I used Botok to filter out sentences with any bad tokens. We now have 16GB of text data. The dataset is here. This dataset consists of the following

spsither commented 6 months ago

https://huggingface.co/spsither/tibetan_RoBERTa_S_e2

spsither commented 6 months ago

refer this paper Image

spsither commented 5 months ago

model at epoch 6 is here

spsither commented 5 months ago

The default trainer API uses DP instead of DDP. I am reading up accelerate so we can use DDP or more advanced techniques.

spsither commented 5 months ago

Referencing t5, bert and roberta for using accelerate.