RFC0081: A performant(!) script for filtering line duplicates or very similar lines

Named Concepts

Line Similarity Check : In the context of text processing and data preparation for machine learning (ML) models, refers to the computational process of determining how closely related or similar two lines of text are to each other. This concept is critical in data preprocessing to ensure the quality and diversity of the training dataset.

Threshold: In the context of a line similarity check for text processing and model training refers to a predefined value or limit that determines the degree of similarity at which two lines of text are considered either similar or not similar. This concept is a crucial parameter in the process of identifying and filtering out duplicate or near-duplicate lines within a dataset.

Edit distance : Way of quantifying how dissimilar two strings are to one another, that is measured by counting the minimum number of operations required to transform one string into the other.

Summary

Duplication or near-duplication in training data can effect the learning process of an OCR model. When a model is repeatedly exposed to the same or similar data, it tends to overfit, meaning it performs well on the training data but poorly on new, unseen data. So thats why we need to filter those data and removes the duplicates.

Here the each Dataset is a collection of text files each containing only a single line. This package should have the capability to perform both Intra-Data set and Inter-Data set similarity check.

Damerau-Levenshtein Distance would be use to measure the edit distance between the string. The score from the Damerau-Levenshtein Distance would be then normalized between the value of 0 and 1. normalized Damerau-Levenshtein Distance score = Damerau-Levenshtein Distance score / Max(length of line 1, length of line 2)

Dependencies

tensorflow

Infrastructures

None

Design Illustrations

edit distance

Justification

1.Damerau-Levenshtein Distance is chosen because it covers all the possible operations in edit distance

Insertion: Adding a single character.
Deletion: Removing a single character.
Substitution: Changing one character to another.
Transposition: Swapping two adjacent characters.

2.When calculation Damerau-Levenshtein Distance, we dont need to import package such that tensorflow or sklearn.

Implementation Steps

List all the steps involved during implementation.

[ ] OpenPecha/line-similarity-checker#1 Estimated time: 2 hours Actual time:
[ ] OpenPecha/line-similarity-checker#2 Estimated time: 3 hours Actual time:
[ ] OpenPecha/line-similarity-checker#3 Estimated time: 3 hours Actual time:
[ ] OpenPecha/line-similarity-checker#4 Estimated time: 2 hours Actual time:

Reviewed By

@ta4tsering

OpenPecha / Requests