euske / pdfminer

Python PDF Parser (Not actively maintained). Check out pdfminer.six.
https://github.com/pdfminer/pdfminer.six
MIT License
5.25k stars 1.13k forks source link

What's wrong with this pdf to text conversion? #241

Open echan00 opened 5 years ago

echan00 commented 5 years ago

I've been using the pdf2txt tool to convert many PDFs in English and Chinese to TXT format. A bunch of files are not working as expected:

Here is the PDF file to be converted: 0.pdf Here is the resulting TXT file: 0.txt

I would be super grateful if someone could tell me what is wrong or point me in the direction towards a fix.

echan00 commented 5 years ago

if it helps here is info about the fonts in the pdf

name                                 type              emb sub uni prob object ID
------------------------------------ ----------------- --- --- --- ---- ---------
ALPMFJ+Times-Roman                   Type 1C           yes yes yes         243  0
ALPMJK+Times-Italic                  Type 1C           yes yes yes         246  0
MHei-Bold-ETen-B5-H-Identity-H       CID Type 0C       yes no  no          249  0
MSung-Light-ETen-B5-H-Identity-H     CID Type 0C       yes no  no          252  0
Microsoft JhengHei,Bold-ETen-B5-H-Identity-H CID Type 0C       yes no  no   X      857  0
Microsoft JhengHei-ETen-B5-H-Identity-H CID Type 0C       yes no  no   X      852  0
AMACHH+TimesNewRomanPSMT             Type 1C           yes yes no          866  0
AMACNG+ArialMT                       Type 1C           yes yes no          861  0
AMADCF+TimesNewRomanPS-BoldMT        Type 1C           yes yes no          859  0
PMingLiU-ETen-B5-H-Identity-H        CID Type 0C       yes no  no   X      865  0
AMADJE+TimesNewRomanPS-ItalicMT      Type 1C           yes yes no          868  0
MS Mincho-KSCms-UHC-H-Identity-H     CID Type 0C       yes no  no   X      869  0
AMCGMO+Calibri,Italic                Type 1C           yes yes yes         872  0
AMCPNE+Calibri,Bold                  Type 1C           yes yes yes         873  0
Microsoft YaHei,Bold-GBK-EUC-H-Identity-H CID Type 0C       yes no  no   X      876  0
Microsoft YaHei-GBK-EUC-H-Identity-H CID Type 0C       yes no  no   X      877  0
AMDONF+Calibri                       Type 1C           yes yes yes         878  0
AMDPFD+Symbol                        Type 1C           yes yes no   X      879  0
AMEAHN+Arial                         Type 1C           yes yes yes         882  0
SimSun-GBK-EUC-H-Identity-H          CID Type 0C       yes no  no   X      887  0
AMECKA+Wingdings                     Type 1C           yes yes no          888  0
MingLiU-ETen-B5-H-Identity-H         CID Type 0C       yes no  no   X      892  0