Open jbrockmendel opened 9 years ago
I'm running into the same issue (font ="unknown" and size="0.000") with this file:
U
.
S
.
D
e
p
a
r
t
m
e
n
t
[...]
FWIW, it seems that poppler-util's pdftotext
and pdftohtml
tools handle this test fine and recognize the fonts:
[...]
<page number="1" position="absolute" top="0" left="0" height="918" width="1188">
<fontspec id="0" size="16" family="Times" color="#000000"/>
<fontspec id="1" size="18" family="Times" color="#000000"/>
<fontspec id="2" size="61" family="Times" color="#000000"/>
<fontspec id="3" size="16" family="Times" color="#000000"/>
<fontspec id="4" size="7" family="Times" color="#000000"/>
<fontspec id="5" size="4" family="Times" color="#000000"/>
<fontspec id="6" size="18" family="Times" color="#000000"/>
<fontspec id="7" size="16" family="Times" color="#000000"/>
[...]
<text top="109" left="147" width="8" height="16" font="0"> </text>
<text top="126" left="54" width="8" height="16" font="0"> </text>
<text top="147" left="54" width="166" height="19" font="1">U.S. Department </text>
<text top="171" left="54" width="168" height="19" font="1">of Transportation </text>
[...]
In converting older (1998 and 1974 in the examples I'll reference below) pdfs to text I am getting output with newlines inserted after every character. A string that ideally should look like:
is instead returned as:
In trying to trace the problem, I looked at the intermediate XML representation, and it appears that each character is being imputed to size="0.000", with bbox values of only one point, e.g:
Looking at the other file, the first few paragraphs it gets right, and I see in the XML entries like:
After those first few paragraphs it reverts to font="unknown" with size="0.000" and the newline issue appears again.
The two example files are linked below.
https://www.dropbox.com/s/cyu0vee584odhbh/Benston_Determinants-of-bid-asked-spreads-in-the-over-the-counter-market_1974.pdf?dl=0
https://www.dropbox.com/s/pqqu3aeu96pduai/Mayers_Why-firms-issue-convertible-bonds-The-matching-of-financial-and-real-investment-options_1998.pdf?dl=0