Filimoa / open-parse

Improved file parsing for LLM’s
https://filimoa.github.io/open-parse/
MIT License
2.54k stars 99 forks source link

Fixes LTAnno objects being skipped which contains the needed whitespace for some PDFs #48

Closed cipherCOM closed 5 months ago

cipherCOM commented 6 months ago

Stumbled over the problem that some PDFs didn't have any whitespace at all. I understood from these StackOverflows [1] [2] that there are some PDF processors that optimize whitespaces to not be LTChar but rather only a LTAnno.

This PR mitigates this problem and also takes LTAnno into account to extract the complete text from these PDFs.

Filimoa commented 5 months ago

Fantastic! Do you have a sample doc for this? Just trying to understand the issue a little better.

cipherCOM commented 5 months ago

Hi Filimoa, sadly I can't share these documents as they contain copyright material. I've seen this on a few of them but not all, so it seems to really boil down to some kind of setting during the export process.

Maybe any of these infos help:

[ExifTool]      ExifTool Version Number         : 12.76
[System]        File Size                       : 10 MB
[File]          File Type                       : PDF
[File]          File Type Extension             : pdf
[File]          MIME Type                       : application/pdf
[PDF]           PDF Version                     : 1.4
[PDF]           Linearized                      : No
[PDF]           Page Count                      : 346
[PDF]           Create Date                     : 2022:11:09 15:02:52+10:00
[PDF]           Creator                         : Serif Affinity Publisher 1.10.5
[PDF]           GTS PDFX Version                : PDF/X-1a:2003
[PDF]           Modify Date                     : 2023:03:07 15:19:40+11:00
[PDF]           Producer                        : PDFlib+PDI 9.3.1-i (macOS (x86_64))
[PDF]           Trapped                         : False
[PDF]           Trapped                         : false
[XMP-x]         XMP Toolkit                     : Adobe XMP Core 9.0-c000 79.cca54b0, 2022/11/26-09:29:55
[XMP-xmpMM]     Version ID                      : 1
[XMP-xmpMM]     Rendition Class                 : default
[XMP-pdf]       Trapped                         : False
[XMP-pdf]       Producer                        : PDFlib+PDI 9.3.1-i (macOS (x86_64))
[XMP-pdfxid]    GTS PDFX Version                : PDF/X-1a:2003
[XMP-xmp]       Metadata Date                   : 2023:03:07 15:19:40+11:00
[XMP-xmp]       Create Date                     : 2022:11:09 15:02:52+10:00
[XMP-xmp]       Modify Date                     : 2023:03:07 15:19:40+11:00
[XMP-xmp]       Creator Tool                    : Serif Affinity Publisher 1.10.5
[XMP-pdfx]      Trapped                         : false
[XMP-dc]        Format                          : application/pdf

But I can give you this at least:

<LTTextLineHorizontal 165.455,405.630,239.040,415.630 'About the Author\n'>
<LTChar 165.455,405.630,172.335,415.630 matrix=[1.00,0.00,0.00,1.00, (165.46,408.00)] font='RXHPNE+GoudyOldStyleT-Bold' adv=6.880000000000001 text='A'>
<LTChar 172.389,405.630,177.519,415.630 matrix=[1.00,0.00,0.00,1.00, (172.39,408.00)] font='RXHPNE+GoudyOldStyleT-Bold' adv=5.13 text='b'>
<LTChar 177.519,405.630,182.549,415.630 matrix=[1.00,0.00,0.00,1.00, (177.52,408.00)] font='RXHPNE+GoudyOldStyleT-Bold' adv=5.03 text='o'>
<LTChar 182.549,405.630,188.069,415.630 matrix=[1.00,0.00,0.00,1.00, (182.55,408.00)] font='RXHPNE+GoudyOldStyleT-Bold' adv=5.5200000000000005 text='u'>
<LTChar 188.069,405.630,191.279,415.630 matrix=[1.00,0.00,0.00,1.00, (188.07,408.00)] font='RXHPNE+GoudyOldStyleT-Bold' adv=3.21 text='t'>
<LTAnno ' '>
<LTChar 193.782,405.630,196.992,415.630 matrix=[1.00,0.00,0.00,1.00, (193.78,408.00)] font='RXHPNE+GoudyOldStyleT-Bold' adv=3.21 text='t'>
<LTChar 196.992,405.630,202.512,415.630 matrix=[1.00,0.00,0.00,1.00, (196.99,408.00)] font='RXHPNE+GoudyOldStyleT-Bold' adv=5.5200000000000005 text='h'>
<LTChar 202.512,405.630,206.772,415.630 matrix=[1.00,0.00,0.00,1.00, (202.51,408.00)] font='RXHPNE+GoudyOldStyleT-Bold' adv=4.26 text='e'>
<LTAnno ' '>
<LTChar 209.275,405.630,216.155,415.630 matrix=[1.00,0.00,0.00,1.00, (209.28,408.00)] font='RXHPNE+GoudyOldStyleT-Bold' adv=6.880000000000001 text='A'>
<LTChar 215.940,405.630,221.460,415.630 matrix=[1.00,0.00,0.00,1.00, (215.94,408.00)] font='RXHPNE+GoudyOldStyleT-Bold' adv=5.5200000000000005 text='u'>
<LTChar 221.460,405.630,224.670,415.630 matrix=[1.00,0.00,0.00,1.00, (221.46,408.00)] font='RXHPNE+GoudyOldStyleT-Bold' adv=3.21 text='t'>
<LTChar 224.670,405.630,230.190,415.630 matrix=[1.00,0.00,0.00,1.00, (224.67,408.00)] font='RXHPNE+GoudyOldStyleT-Bold' adv=5.5200000000000005 text='h'>
<LTChar 230.190,405.630,235.220,415.630 matrix=[1.00,0.00,0.00,1.00, (230.19,408.00)] font='RXHPNE+GoudyOldStyleT-Bold' adv=5.03 text='o'>
<LTChar 235.220,405.630,239.040,415.630 matrix=[1.00,0.00,0.00,1.00, (235.22,408.00)] font='RXHPNE+GoudyOldStyleT-Bold' adv=3.8200000000000003 text='r'>
<LTAnno '\n'>
Filimoa commented 5 months ago

Ok I looked into this a little more - any chance you could write a quick test for _extract_chars with your example + regular example in src/tests/text/pdf_miner/test_core.py and we can merge?

cipherCOM commented 5 months ago

Sorry, but I can't help further at the moment. We already removed the dependency for open-parse, but I wanted to at least share this fix / findings with everyone in hopes it helps.

Filimoa commented 5 months ago

Appreciate the help on this a ton - added a test and merged in #51