Open gangagyatso4364 opened 1 year ago
non_word definition is that if the pos tag of a token is NON_WORD and skrt attribute of the token is false non_bo_word definition is that if the chunk_type attribute of the token is LATIN,CJK, OTHER. But in case of OTHER, skrt attribute of the token should be false @gangagyatso4364 this definition of non word and non bo word i m proposing. Plus for such RFC i would love to see flowchart to understand things faster and better.
Myself and @ta4tsering have went through the RFC and Updated a test case and expected output. Please go through them also.
yes will go through it
RFC0068: Using Botok for Analyzing OCR Text Quality in Openpecha-data
Named Concepts
Summary
This RFC proposes the use of the Botok library to analyze the quality of OCR text within the openpecha-data repository. Specifically, it aims to identify illegible or unusable text portions within the e-texts that contain non-words.
Dependencies
Infrastructures
Design Illustrations
Explainations: First, it sets up the necessary environment and then list of Pecha IDs is given. Next, it extracts and processes text from corresponding GitHub repositories, using the Botok library to tokenize and analyze the text for non-words and non-Tibetan words. Finally, the results are compiled and outputted into a json file, with error handling integrated throughout the process.
Justification
The proposed design was selected over alternatives because Botok is a well-established NLP library that can be used to analyze and process text effectively. It offers robust tools for identifying non-words and assessing text quality. Utilizing Botok aligns with industry best practices for NLP tasks.
The impact of alternative approaches, such as using custom OCR quality assessment tools, would likely require significant development effort and may not be as accurate or efficient as utilizing a specialized library like Botok.
Testing
desired_output in dict:
test string = "abdul kalamའཁྱེད་ལ་སུན་པོ་ཀཟོས་པར་དགོངས་དག་ཞུ་ཨུམ་ཨུམ་་་། ()$%322 ༣༢༢་�� 你好吗 कैसे ཨོཾ་མ་ཎི་པདྨེ་ཧཱུྃ" expected_non_word = 4 expected_non_bo = 4
Implementation Steps
[ ] OpenPecha/pechadata_analysis#1 Estimated time: 3 hours Actual time: 2 hours
[ ] OpenPecha/pechadata_analysis#2 Estimated time: 3 hours Actual time: 3 hours
[ ] OpenPecha/pechadata_analysis#3 Estimated time: 3 hours Actual time: 4 hours
[ ] OpenPecha/pechadata_analysis#4 Estimated time: 3 hours Actual time: 3 hours
[ ] OpenPecha/pechadata_analysis#5 Estimated time: 3 hours Actual time: 4 hours
[ ] OpenPecha/pechadata_analysis#6 Estimated time: 3 hours Actual time: 5 hours
[ ] OpenPecha/pechadata_analysis#7 Estimated time: 3 hours Actual time:
Reviewed By @ta4tsering