shrivastava95 commented 1 year ago

Description

This issue serves the purpose of planning out the task of breaking down the task of parsing various types of PDFs based on avialability of different content types, and the problems that are needed to be addressed before that.

Problems

[ ] Text Extraction
- [ ] OCR-based text extraction method with high accuracy
- [ ] substitution-cipher solving based solution for PDFs with a text layer available
[ ] Text / Object boundary Detection
- [ ] detection of columns
- [ ] handling of tables and tabular data
- [ ] creation of a small UI for definition of text boundaries, to carry out OCR within that boundary. Allow user to also define and edit these boundaries.

Solutions

We will organise the solutions based on the different available PDF types

1. Text Layer

The rendering of text in a pdf can happen in many different ways. Many PDFs of our interest are displayed on the screen using two different layers - a text layer and an image layer. The image layer contains the rendering of the characters that are displayed on the screen while the text layer contains the unicode symbols that one can obtain using copy paste. Each unique unicode symbol in the text layer is assigned its own character to be rendered in the image layer at the corresponding position. Although the characters in the text layer are often mapped to entirely different characters in the image layer, this is reversible as it is essentially reduces to a substitution cipher solving problem.

2. Scanned PDFs

Scanned PDFs do not contain text that can be selected which means that processing using OCR is needed.

3. PDFs with other languages

Our aim is to ensure that the above mentioned techniques generalise well across a wide range of supported languages and ensure that the pipeline works in a plug-and-play fashion so that individual components can be swapped out and replaced in order to facilitate expansion across a wide range of languages.

TakshPanchal commented 1 year ago

I am interested in this module; how can I help further?

TakshPanchal commented 1 year ago

Hey @GautamR-Samagra, I recently connected with @shrivastava95 about how this issue helps in the Dictionary Augmented Transformers project. I understood the difficulties faced during parsing the dictionary. Currently, I am thinking of looking for ways to

fine-tune the tesseract ocr model for the Odia-English dictionary
any better ways to parse text for ex: I have tried different libraries like Layout parser to extract text and using OCR on that cropped text.

GautamR-Samagra commented 1 year ago

@TakshPanchal That is great! Any help on this is appreciated. Required all across the sector. The next step after being able to identify paragraphs, blocks to parse (and parse them) would be to identify tables too and get the right tool to parse them.

Samagra-Development / ai-tools