Closed cipherCOM closed 5 months ago
Fantastic! Do you have a sample doc for this? Just trying to understand the issue a little better.
Hi Filimoa, sadly I can't share these documents as they contain copyright material. I've seen this on a few of them but not all, so it seems to really boil down to some kind of setting during the export process.
Maybe any of these infos help:
[ExifTool] ExifTool Version Number : 12.76
[System] File Size : 10 MB
[File] File Type : PDF
[File] File Type Extension : pdf
[File] MIME Type : application/pdf
[PDF] PDF Version : 1.4
[PDF] Linearized : No
[PDF] Page Count : 346
[PDF] Create Date : 2022:11:09 15:02:52+10:00
[PDF] Creator : Serif Affinity Publisher 1.10.5
[PDF] GTS PDFX Version : PDF/X-1a:2003
[PDF] Modify Date : 2023:03:07 15:19:40+11:00
[PDF] Producer : PDFlib+PDI 9.3.1-i (macOS (x86_64))
[PDF] Trapped : False
[PDF] Trapped : false
[XMP-x] XMP Toolkit : Adobe XMP Core 9.0-c000 79.cca54b0, 2022/11/26-09:29:55
[XMP-xmpMM] Version ID : 1
[XMP-xmpMM] Rendition Class : default
[XMP-pdf] Trapped : False
[XMP-pdf] Producer : PDFlib+PDI 9.3.1-i (macOS (x86_64))
[XMP-pdfxid] GTS PDFX Version : PDF/X-1a:2003
[XMP-xmp] Metadata Date : 2023:03:07 15:19:40+11:00
[XMP-xmp] Create Date : 2022:11:09 15:02:52+10:00
[XMP-xmp] Modify Date : 2023:03:07 15:19:40+11:00
[XMP-xmp] Creator Tool : Serif Affinity Publisher 1.10.5
[XMP-pdfx] Trapped : false
[XMP-dc] Format : application/pdf
But I can give you this at least:
<LTTextLineHorizontal 165.455,405.630,239.040,415.630 'About the Author\n'>
<LTChar 165.455,405.630,172.335,415.630 matrix=[1.00,0.00,0.00,1.00, (165.46,408.00)] font='RXHPNE+GoudyOldStyleT-Bold' adv=6.880000000000001 text='A'>
<LTChar 172.389,405.630,177.519,415.630 matrix=[1.00,0.00,0.00,1.00, (172.39,408.00)] font='RXHPNE+GoudyOldStyleT-Bold' adv=5.13 text='b'>
<LTChar 177.519,405.630,182.549,415.630 matrix=[1.00,0.00,0.00,1.00, (177.52,408.00)] font='RXHPNE+GoudyOldStyleT-Bold' adv=5.03 text='o'>
<LTChar 182.549,405.630,188.069,415.630 matrix=[1.00,0.00,0.00,1.00, (182.55,408.00)] font='RXHPNE+GoudyOldStyleT-Bold' adv=5.5200000000000005 text='u'>
<LTChar 188.069,405.630,191.279,415.630 matrix=[1.00,0.00,0.00,1.00, (188.07,408.00)] font='RXHPNE+GoudyOldStyleT-Bold' adv=3.21 text='t'>
<LTAnno ' '>
<LTChar 193.782,405.630,196.992,415.630 matrix=[1.00,0.00,0.00,1.00, (193.78,408.00)] font='RXHPNE+GoudyOldStyleT-Bold' adv=3.21 text='t'>
<LTChar 196.992,405.630,202.512,415.630 matrix=[1.00,0.00,0.00,1.00, (196.99,408.00)] font='RXHPNE+GoudyOldStyleT-Bold' adv=5.5200000000000005 text='h'>
<LTChar 202.512,405.630,206.772,415.630 matrix=[1.00,0.00,0.00,1.00, (202.51,408.00)] font='RXHPNE+GoudyOldStyleT-Bold' adv=4.26 text='e'>
<LTAnno ' '>
<LTChar 209.275,405.630,216.155,415.630 matrix=[1.00,0.00,0.00,1.00, (209.28,408.00)] font='RXHPNE+GoudyOldStyleT-Bold' adv=6.880000000000001 text='A'>
<LTChar 215.940,405.630,221.460,415.630 matrix=[1.00,0.00,0.00,1.00, (215.94,408.00)] font='RXHPNE+GoudyOldStyleT-Bold' adv=5.5200000000000005 text='u'>
<LTChar 221.460,405.630,224.670,415.630 matrix=[1.00,0.00,0.00,1.00, (221.46,408.00)] font='RXHPNE+GoudyOldStyleT-Bold' adv=3.21 text='t'>
<LTChar 224.670,405.630,230.190,415.630 matrix=[1.00,0.00,0.00,1.00, (224.67,408.00)] font='RXHPNE+GoudyOldStyleT-Bold' adv=5.5200000000000005 text='h'>
<LTChar 230.190,405.630,235.220,415.630 matrix=[1.00,0.00,0.00,1.00, (230.19,408.00)] font='RXHPNE+GoudyOldStyleT-Bold' adv=5.03 text='o'>
<LTChar 235.220,405.630,239.040,415.630 matrix=[1.00,0.00,0.00,1.00, (235.22,408.00)] font='RXHPNE+GoudyOldStyleT-Bold' adv=3.8200000000000003 text='r'>
<LTAnno '\n'>
Ok I looked into this a little more - any chance you could write a quick test for _extract_chars
with your example + regular example in src/tests/text/pdf_miner/test_core.py
and we can merge?
Sorry, but I can't help further at the moment. We already removed the dependency for open-parse, but I wanted to at least share this fix / findings with everyone in hopes it helps.
Appreciate the help on this a ton - added a test and merged in #51
Stumbled over the problem that some PDFs didn't have any whitespace at all. I understood from these StackOverflows [1] [2] that there are some PDF processors that optimize whitespaces to not be
LTChar
but rather only aLTAnno
.This PR mitigates this problem and also takes
LTAnno
into account to extract the complete text from these PDFs.