STT0021: Quality check with Vecalign for Audio-Transcript - Githubissues

OpenPecha / RoBERTa

MIT License

0 stars 0 forks source link

STT0021: Quality check with Vecalign for Audio-Transcript #1

Open spsither opened 6 months ago

spsither commented 6 months ago

Description

Allign segment embedding and transcript embedding. Use a similarity threshold for quality checks.

Completion Criteria

Push the pair of models to HF.

Implementation Plan

Pre-train a BERT-style encoder model for Tibetan. Fine-tune wav2vec2 and BERT-style model to align the embedding vectors.

Subtasks

[ ] Pre-train RoBERTa
[ ] Fine-tune to align the embeddings
[ ] Find the similarity threshold for quality check

spsither commented 6 months ago

Reference blog and [code]()

spsither commented 6 months ago

Trained 2 tokenizers on 15GB Gold data and 45GB A

spsither commented 6 months ago

Filtering the A data with this function and doing some regex cleaning

def max_char_btw_tsak(example):
    segments = example.split('་')
    max_length = max(len(segment) for segment in segments if segment)
    return max_length

spsither commented 6 months ago

Lopa says བསྒྲིགས། is the maximum valid char between tsak

Taking 8 as the filtering threshold

spsither commented 6 months ago

filtered the 45GB A by using the following conditions max_char_btw_tsak is the max syllable length in a sentence char_len is the character length of the sentence

max_char_btw_tsak > 1
max_char_btw_tsak < 9
char_len > 15
char_len < 1000

spsither commented 6 months ago

This has A with max_char_btw_tsak and char_len metadata https://huggingface.co/datasets/spsither/tibetan_monolingual_A_meta

Filtered A is at https://huggingface.co/datasets/spsither/tibetan_monolingual_A_filtered

Filtered and Deduped A on sentence is at https://huggingface.co/datasets/spsither/tibetan_monolingual_A_filtered_deduped

spsither commented 6 months ago

Action	Train size	Test size	% drop
Original	335968205	18500104	0
Filtered	293280837	16162978	12.7
Dedupe + Filtered	93214402	11561844	70.4

45GB -> 18GB

spsither commented 6 months ago

data is still not clean bad tokens examples

དགའཁས
གནསབདག
དངབཅསཔདང
སསསསསསསས
བབབབབབབབ

spsither commented 6 months ago

Started training RoBERTa Large on 18GB A_filtered_deduped data at Apr 26th 4:50 PM Using ml.g5.4xlarge

spsither commented 6 months ago

spsither commented 6 months ago

Most of the text data we have is about Buddhism, might be good to use these for training

https://huggingface.co/datasets/oscar-corpus/OSCAR-2201

https://huggingface.co/datasets/oscar-corpus/OSCAR-2301

kaldan007 commented 6 months ago

https://zenodo.org/records/3951503

spsither commented 6 months ago

Word Segmentation data from Marieke is here

spsither commented 6 months ago

I used Botok to filter out sentences with any bad tokens. We now have 16GB of text data. The dataset is here. This dataset consists of the following

spsither commented 6 months ago

https://huggingface.co/spsither/tibetan_RoBERTa_S_e2

spsither commented 6 months ago

refer this paper

spsither commented 5 months ago

model at epoch 6 is here

spsither commented 5 months ago

The default trainer API uses DP instead of DDP. I am reading up accelerate so we can use DDP or more advanced techniques.

spsither commented 5 months ago

Referencing t5, bert and roberta for using accelerate.