gangagyatso4364 commented 1 year ago

RFC0068: Using Botok for Analyzing OCR Text Quality in Openpecha-data

Named Concepts

OCR: Optical Character Recognition, a technology to convert scanned or printed text into machine-encoded text.
OpenpechaFormat : It is a format followed by openpecha-data, use openpecha toolkit to parse the opf.
Botok: A natural language processing library/tool.

Summary

This RFC proposes the use of the Botok library to analyze the quality of OCR text within the openpecha-data repository. Specifically, it aims to identify illegible or unusable text portions within the e-texts that contain non-words.

Dependencies

Botok library
Openpecha-toolkit Library
openpecha-data repository

Infrastructures

Local development environment with access to openpecha-data repository.

Design Illustrations

Untitled(2)

Explainations: First, it sets up the necessary environment and then list of Pecha IDs is given. Next, it extracts and processes text from corresponding GitHub repositories, using the Botok library to tokenize and analyze the text for non-words and non-Tibetan words. Finally, the results are compiled and outputted into a json file, with error handling integrated throughout the process.

Justification

The proposed design was selected over alternatives because Botok is a well-established NLP library that can be used to analyze and process text effectively. It offers robust tools for identifying non-words and assessing text quality. Utilizing Botok aligns with industry best practices for NLP tasks.

The impact of alternative approaches, such as using custom OCR quality assessment tools, would likely require significant development effort and may not be as accurate or efficient as utilizing a specialized library like Botok.

Testing

desired_output in dict:

opf_id :
      base_text : base_text_name.txt
      character_start : 1190
      character_end : 2190
      total_word_count : 1000
      non_word_count : 25
      non_bo_word_count: 40

test string = "abdul kalamའཁྱེད་ལ་སུན་པོ་ཀཟོས་པར་དགོངས་དག་ཞུ་ཨུམ་ཨུམ་་་། ()$%322 ༣༢༢་�� 你好吗 कैसे ཨོཾ་མ་ཎི་པདྨེ་ཧཱུྃ" expected_non_word = 4 expected_non_bo = 4

Implementation Steps

[ ] OpenPecha/pechadata_analysis#1 Estimated time: 3 hours Actual time: 2 hours
[ ] OpenPecha/pechadata_analysis#2 Estimated time: 3 hours Actual time: 3 hours
[ ] OpenPecha/pechadata_analysis#3 Estimated time: 3 hours Actual time: 4 hours
[ ] OpenPecha/pechadata_analysis#4 Estimated time: 3 hours Actual time: 3 hours
[ ] OpenPecha/pechadata_analysis#5 Estimated time: 3 hours Actual time: 4 hours
[ ] OpenPecha/pechadata_analysis#6 Estimated time: 3 hours Actual time: 5 hours
[ ] OpenPecha/pechadata_analysis#7 Estimated time: 3 hours Actual time:

Reviewed By @ta4tsering

kaldan007 commented 12 months ago

non_word definition is that if the pos tag of a token is NON_WORD and skrt attribute of the token is false non_bo_word definition is that if the chunk_type attribute of the token is LATIN,CJK, OTHER. But in case of OTHER, skrt attribute of the token should be false @gangagyatso4364 this definition of non word and non bo word i m proposing. Plus for such RFC i would love to see flowchart to understand things faster and better.

Myself and @ta4tsering have went through the RFC and Updated a test case and expected output. Please go through them also.

gangagyatso4364 commented 12 months ago

yes will go through it

OpenPecha / Requests