Open ta4tsering opened 6 months ago
these are the stacks and its frequency for the norbuketaka and google books transcripts after normalising and then tokenized into stacks. If you see the csv, even though norbuketaka has more number of books and about 3M lines it has only about 564 unique stacks in it but google books having only 100 books and 700K lines it has about 1588 unique stacks and some only appear once.
updated the script to filter out invalid stacks and uploaded the repo with the frequency files in it.
writing test for the modules of the repo
Description: There should be a new way to get the stacks list and frequency.
Reference: https://github.com/OpenPecha/Botok/blob/master/botok/utils/unicode_normalization.py#L101 https://github.com/OpenPecha/Botok/blob/master/botok/utils/lenient_normalization.py#L253 the best workflow would be to take the corpus, apply these two functions and then tokenize in stacks with https://github.com/OpenPecha/Botok/blob/master/botok/tokenizers/stacktokenizer.py so that we have the same workflow for producing stack frequency lists
some constraints to implement are:
subtasks:
Completion Criteria: It should be a list of stacks and its frequency for Norbuketaka and Google Books data.