euske / pdfminer

Python PDF Parser (Not actively maintained). Check out pdfminer.six.
https://github.com/pdfminer/pdfminer.six
MIT License
5.26k stars 1.13k forks source link

Newline inserted after every character #90

Open jbrockmendel opened 9 years ago

jbrockmendel commented 9 years ago

In converting older (1998 and 1974 in the examples I'll reference below) pdfs to text I am getting output with newlines inserted after every character. A string that ideally should look like:

Abstract
\tThis paper contends [...]

is instead returned as:

A
b
s
t
r
a
c
t

T
h
i
s

p
a
p
e
r

c
o
n
t
e
n
d
s

In trying to trace the problem, I looked at the intermediate XML representation, and it appears that each character is being imputed to size="0.000", with bbox values of only one point, e.g:

<textline bbox="170.698,584.640,170.698,584.640">
<text font="unknown" bbox="170.698,584.640,170.698,584.640" size="0.000"> </text>
<text>
</text>
</textline>
<textline bbox="179.760,584.640,179.760,584.640">
<text font="unknown" bbox="179.760,584.640,179.760,584.640" size="0.000">F</text>
<text>
</text>

Looking at the other file, the first few paragraphs it gets right, and I see in the XML entries like:

<text font="CenturyGothic,Bold" bbox="64.320,105.838,68.938,113.039" size="7.201">A</text>

After those first few paragraphs it reverts to font="unknown" with size="0.000" and the newline issue appears again.

The two example files are linked below.

https://www.dropbox.com/s/cyu0vee584odhbh/Benston_Determinants-of-bid-asked-spreads-in-the-over-the-counter-market_1974.pdf?dl=0

https://www.dropbox.com/s/pqqu3aeu96pduai/Mayers_Why-firms-issue-convertible-bonds-The-matching-of-financial-and-real-investment-options_1998.pdf?dl=0

jsvine commented 7 years ago

I'm running into the same issue (font ="unknown" and size="0.000") with this file:

U
.
S
.

D
e
p
a
r
t
m
e
n
t
[...]

FWIW, it seems that poppler-util's pdftotext and pdftohtml tools handle this test fine and recognize the fonts:

[...]
<page number="1" position="absolute" top="0" left="0" height="918" width="1188">
    <fontspec id="0" size="16" family="Times" color="#000000"/>
    <fontspec id="1" size="18" family="Times" color="#000000"/>
    <fontspec id="2" size="61" family="Times" color="#000000"/>
    <fontspec id="3" size="16" family="Times" color="#000000"/>
    <fontspec id="4" size="7" family="Times" color="#000000"/>
    <fontspec id="5" size="4" family="Times" color="#000000"/>
    <fontspec id="6" size="18" family="Times" color="#000000"/>
    <fontspec id="7" size="16" family="Times" color="#000000"/>
[...]
<text top="109" left="147" width="8" height="16" font="0"> </text>
<text top="126" left="54" width="8" height="16" font="0"> </text>
<text top="147" left="54" width="166" height="19" font="1">U.S. Department </text>
<text top="171" left="54" width="168" height="19" font="1">of Transportation </text>
[...]