climatepolicyradar / navigator

Policy navigator
BSD 3-Clause "New" or "Revised" License
4 stars 0 forks source link

Document processing > Extract passages from document using PDF parser #53

Closed chrisaballard closed 2 years ago

chrisaballard commented 2 years ago

When adding a new PDF document, the text contained in the PDF document should be directly extracted using a PDF parser.

As a quick initial baseline, implement a basic pipeline to extract text using a PDF parser. This should allow all text on each page in the PDF to be extracted. The output of the parser should be split into passages corresponding to sentences. The output is expected to be noisy but will be an initial baseline which can be improved and built on.

chrisaballard commented 2 years ago

@kdutia and I agreed that the scope of this task should be limited to outputing extracted text in text files. This will allow Kalyan to evaluate quality of the extracted text using the research pipeline.

chrisaballard commented 2 years ago

Hey team! Please add your planning poker estimate with ZenHub @eurolife @kdutia @opyate

chrisaballard commented 2 years ago

@kdutia @opyate here are my thoughts on the approach for the extract passages task. I'd welcome your thoughts on this to sense check my thinking...

Proposed approach

Process a single PDF file

Post-processing extracted text

Process a batch of PDF files

Output

{
   "filename": "pdf_filename.pdf",
   "textBlocks" : [
                {
                     "text": ["some extracted text", "some more extracted text", ...],
                     "blockId": 1,
                     "pageId": 1
                },
                {
                     "text": ["some more extracted text", "some more extracted text", ...],
                     "blockId": 2,
                     "pageId": 1
                },
                {
                     "text": ["some more extracted text", "some more extracted text", ...],
                     "blockId": 1
                     "pageId": 2
                },
               ...
              ]
}

Some questions

kdutia commented 2 years ago

Looks good to me. Is the idea that the blockId encodes natural reading order? I have very little experience with DAG pipelines so happy to go with whatever you both think is best.

It looks like this also covers the work in #335.

opyate commented 2 years ago

Create a pluggabe processor which can be hooked in to process each page - this would allow PDF parser to be replaced with image export + OCR easily in following sprints

For OCR, I would recommend stitching the pages together and treating the document as one large image. I've seen PDFs where related content is split over pages (be it words, diagrams, tables), and the content getting split. (Or a similar approach.)

kdutia commented 2 years ago

@opyate good point. Most OCR SaaS provides ability to upload PDFs directly, and it's opaque as to whether they use embedded text or adjacent page features to extract text for each page. (While it's unlikely that they do the latter, some heuristics or ML that we build might use features from surrounding pages).

I think that stitching the pages together might introduce additional complexity as we'd then need to unstitch them e.g. for header and footer detection.

Based on this I think the interface to the pipeline should be a PDF, with a pre-processing step to extract images for each page if that's what needs to be done for the particular text extraction method.

chrisaballard commented 2 years ago

Looks good to me. Is the idea that the blockId encodes natural reading order?

Yes, that's what I was thinking of using the blockId for @kdutia - to make the order explicit rather than inferred from the position in the list.

I have very little experience with DAG pipelines so happy to go with whatever you both think is best.

Regarding the pipeline, I'm a little hesitant to fix us on a particular workflow orchestration framework yet. Until we have the final go/no-go on OCR based on the results, and better knowledge about what steps need to be applied. I think it would be best for the scope of this task to write the steps as reusuable functions, then can then easily be incorporated into a workflow later.

chrisaballard commented 2 years ago

I have tested the pdf parser that is used by grobid - pdfalto.

This generates an XML representation of the PDF structure in ALTO format.

This seems to have a number of benefits over the other parsers we have tried:

pdfalto handles a variety of cleaning processes: UTF-8 encoding and character composition, the recognition of superscript/subscript style and the robust recognition of line numbers for review manuscripts, the recovery of text order at block level, the detection of columns, etc. The detection of token boundaries, lines and block information are using XY projection and heuristics. pdfalto also extracts embedded bitmap (all converted into PNG) and vector graphics (in SVG), PDF metadata (XMP) and PDF annotations for further usage in GROBID.

I have written some test code to parse the xml output into the above json structure - see attached files.

chrisaballard commented 2 years ago

I have started a notion page which summarises the issues that have been observed in the text extracted from pdfalto and the post-processing steps that would be required to resolve those issues.

https://www.notion.so/climatepolicyradar/PDF-parser-issues-and-post-processing-requirements-370090b695104019aabe7ee815a3c55b

chrisaballard commented 2 years ago

Remaining changes for this sprint:

chrisaballard commented 2 years ago

I have added a new pipeline folder to the repo that contains the beginnings of a batch process to extract text from pdf documents.

The initial implementation is as a cli which can be used to extract the text, text blocks and positional information from a set of pdf documents. This uses the pdfalto parser (used by grobid) to extract the text embedded in each document.

A Dockerfile is provided which can be used install the pdfalto dependencies and run the pipeline. Usage of the cli and instructions on how to build the docker image are described in /pipeline/README.md.

Note: currently no unit tests have been defined. This is because we will need to create a new github action to run the pipeline tests in the pipeline docker container. There was insufficient time in this sprint to do that, so have raised a new issue #349 for this.

Raised pr https://github.com/climatepolicyradar/navigator/pull/350 containing changes for review by @kdutia