chrisaballard commented 2 years ago

When adding a new PDF document, the text contained in the PDF document should be directly extracted using a PDF parser.

As a quick initial baseline, implement a basic pipeline to extract text using a PDF parser. This should allow all text on each page in the PDF to be extracted. The output of the parser should be split into passages corresponding to sentences. The output is expected to be noisy but will be an initial baseline which can be improved and built on.

chrisaballard commented 2 years ago

@kdutia and I agreed that the scope of this task should be limited to outputing extracted text in text files. This will allow Kalyan to evaluate quality of the extracted text using the research pipeline.

chrisaballard commented 2 years ago

Hey team! Please add your planning poker estimate with ZenHub @eurolife @kdutia @opyate

chrisaballard commented 2 years ago

@kdutia @opyate here are my thoughts on the approach for the extract passages task. I'd welcome your thoughts on this to sense check my thinking...

Proposed approach

Process a single PDF file

Iterate through each page in the PDF file
Create a pluggabe processor which can be hooked in to process each page - this would allow PDF parser to be replaced with image export + OCR easily in following sprints
Create a processor which processes a page using a PDF parser
- Test pdfminer and compare to PyMuPDF

Post-processing extracted text

Allow post-processors to be defined - this will allow new post-processors to be added later down the line for task #231 - e.g. filtering tables using heuristics
Input is a list of text blocks, with page_id, x1, y1, x2, y2 coords + text
Sort text blocks into English natural reading order
Relate each block to a page number
Arrange text by paragraph using text blocks identified by parser
Segment text blocks into sentences
Text cleaning (depending on issues identified)

Process a batch of PDF files

Given a list local files it will process each of the files in turn.
Given a url to an s3 bucket/folder, it will process all of the files in the bucket

Output

Text file for each PDF file containing raw extracted text with CRLF separating each text block (~paragraph)
Structured json for each PDF file which contains an array of text in each text block with an attribute identifying the page in which that block is contained.

{
   "filename": "pdf_filename.pdf",
   "textBlocks" : [
                {
                     "text": ["some extracted text", "some more extracted text", ...],
                     "blockId": 1,
                     "pageId": 1
                },
                {
                     "text": ["some more extracted text", "some more extracted text", ...],
                     "blockId": 2,
                     "pageId": 1
                },
                {
                     "text": ["some more extracted text", "some more extracted text", ...],
                     "blockId": 1
                     "pageId": 2
                },
               ...
              ]
}

Some questions

This could be built using a dag pipeline tool like dagster, but do we want to fix on this now, or write as more vanilla python which could then be incorporated into a pipeline later? E.g. we don't know yet whether we will directly use AWS services or not.

kdutia commented 2 years ago

Looks good to me. Is the idea that the blockId encodes natural reading order? I have very little experience with DAG pipelines so happy to go with whatever you both think is best.

It looks like this also covers the work in #335.

opyate commented 2 years ago

Create a pluggabe processor which can be hooked in to process each page - this would allow PDF parser to be replaced with image export + OCR easily in following sprints

For OCR, I would recommend stitching the pages together and treating the document as one large image. I've seen PDFs where related content is split over pages (be it words, diagrams, tables), and the content getting split. (Or a similar approach.)

kdutia commented 2 years ago

@opyate good point. Most OCR SaaS provides ability to upload PDFs directly, and it's opaque as to whether they use embedded text or adjacent page features to extract text for each page. (While it's unlikely that they do the latter, some heuristics or ML that we build might use features from surrounding pages).

I think that stitching the pages together might introduce additional complexity as we'd then need to unstitch them e.g. for header and footer detection.

Based on this I think the interface to the pipeline should be a PDF, with a pre-processing step to extract images for each page if that's what needs to be done for the particular text extraction method.

chrisaballard commented 2 years ago

Looks good to me. Is the idea that the blockId encodes natural reading order?

Yes, that's what I was thinking of using the blockId for @kdutia - to make the order explicit rather than inferred from the position in the list.

I have very little experience with DAG pipelines so happy to go with whatever you both think is best.

Regarding the pipeline, I'm a little hesitant to fix us on a particular workflow orchestration framework yet. Until we have the final go/no-go on OCR based on the results, and better knowledge about what steps need to be applied. I think it would be best for the scope of this task to write the steps as reusuable functions, then can then easily be incorporated into a workflow later.

chrisaballard commented 2 years ago

I have tested the pdf parser that is used by grobid - pdfalto.

This generates an XML representation of the PDF structure in ALTO format.

This seems to have a number of benefits over the other parsers we have tried:

pdfalto handles a variety of cleaning processes: UTF-8 encoding and character composition, the recognition of superscript/subscript style and the robust recognition of line numbers for review manuscripts, the recovery of text order at block level, the detection of columns, etc. The detection of token boundaries, lines and block information are using XY projection and heuristics. pdfalto also extracts embedded bitmap (all converted into PNG) and vector graphics (in SVG), PDF metadata (XMP) and PDF annotations for further usage in GROBID.

I have written some test code to parse the xml output into the above json structure - see attached files.

chrisaballard commented 2 years ago

I have started a notion page which summarises the issues that have been observed in the text extracted from pdfalto and the post-processing steps that would be required to resolve those issues.

https://www.notion.so/climatepolicyradar/PDF-parser-issues-and-post-processing-requirements-370090b695104019aabe7ee815a3c55b

chrisaballard commented 2 years ago

Remaining changes for this sprint:

Add unit tests
Add readme
Define docker container
tidy up code

chrisaballard commented 2 years ago

I have added a new pipeline folder to the repo that contains the beginnings of a batch process to extract text from pdf documents.

The initial implementation is as a cli which can be used to extract the text, text blocks and positional information from a set of pdf documents. This uses the pdfalto parser (used by grobid) to extract the text embedded in each document.

A Dockerfile is provided which can be used install the pdfalto dependencies and run the pipeline. Usage of the cli and instructions on how to build the docker image are described in /pipeline/README.md.

Note: currently no unit tests have been defined. This is because we will need to create a new github action to run the pipeline tests in the pipeline docker container. There was insufficient time in this sprint to do that, so have raised a new issue #349 for this.

Raised pr https://github.com/climatepolicyradar/navigator/pull/350 containing changes for review by @kdutia

climatepolicyradar / navigator

Document processing > Extract passages from document using PDF parser #53

Proposed approach

Process a single PDF file

Post-processing extracted text

Process a batch of PDF files

Output

Some questions