OCR0024: Create stacks list and frequency script

ta4tsering commented 6 months ago

Description: There should be a new way to get the stacks list and frequency.

Reference: https://github.com/OpenPecha/Botok/blob/master/botok/utils/unicode_normalization.py#L101 https://github.com/OpenPecha/Botok/blob/master/botok/utils/lenient_normalization.py#L253 the best workflow would be to take the corpus, apply these two functions and then tokenize in stacks with https://github.com/OpenPecha/Botok/blob/master/botok/tokenizers/stacktokenizer.py so that we have the same workflow for producing stack frequency lists

some constraints to implement are:

vowels can only follow consonants, vowels or subscripts (not space, shad, symbols, etc.)
subscripts can only follow consonants

subtasks:

[x] create the script
[x] test the script
[x] run it on the aws server
[x] apply rules to filter out some weird stacks or invalid stacks using regex

Completion Criteria: It should be a list of stacks and its frequency for Norbuketaka and Google Books data.

ta4tsering commented 6 months ago

these are the stacks and its frequency for the norbuketaka and google books transcripts after normalising and then tokenized into stacks. If you see the csv, even though norbuketaka has more number of books and about 3M lines it has only about 564 unique stacks in it but google books having only 100 books and 700K lines it has about 1588 unique stacks and some only appear once.

google_books.csv norbunorbuketaka.csv

ta4tsering commented 5 months ago

updated the script to filter out invalid stacks and uploaded the repo with the frequency files in it.

ta4tsering commented 5 months ago

writing test for the modules of the repo

OpenPecha / Tibetan_stacks_and_frequency

OCR0024: Create stacks list and frequency script #2