Still have issues with CID Characters

JSB97 commented 10 years ago

I am trying to extract information from this file; http://www.kantei.go.jp/jp/singi/tiiki/siryou/pdf/h25yosan2.pdf

Following the example code on the pdfminer website, I put together this simple code which tries to extract text using LTTextBoxHorizontal class, I get the output as

(cid:5561)(cid:6210)(cid:18446)(cid:18449)(cid:5562)(cid:2979)(cid:10220)(cid:6715)(cid:5587)(cid:7244)(cid:18171)(cid:9490)(cid:18202)(cid:13240)(cid:18190)(cid:18204)(cid:18159)(cid:4485)(cid:4582)(cid:8049)(cid:5878)(cid:3820)(cid:6795)(cid:10183)

and not the Japanese unicode characters. I get similar results when using the pdf2txt.py tool.

Could someone suggest what I should do to resolve this? Thank you in advance.

Code

from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.pdfdevice import PDFDevice
from pdfminer.layout import LTTextBoxHorizontal
from pdfminer.layout import LAParams
from pdfminer.converter import PDFPageAggregator

# Open a PDF file.
fp = open('/Users/Documents/h25yosan2.pdf', 'rb')
password=''
# Create a PDF parser object associated with the file object.
parser = PDFParser(fp)
# Create a PDF document object that stores the document structure.
document = PDFDocument(parser)
# Supply the password for initialization.
# (If no password is set, give an empty string.)
document.initialize(password)
# Create a PDF resource manager object that stores shared resources.
rsrcmgr = PDFResourceManager()
# Set parameters for analysis.
laparams = LAParams()

# Create a PDF page aggregator object.
device = PDFPageAggregator(rsrcmgr, laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)
for page in PDFPage.create_pages(document):
    interpreter.process_page(page)
    # receive the LTPage object for the page.
    layout = device.get_result()
    objstack = list(reversed(layout._objs))

    while objstack:
        b = objstack.pop()
        if type(b) == LTTextBoxHorizontal: # Text Box H
            print "get text line is %s" % b.get_text().encode('utf-8')

euske commented 10 years ago

Your sample code worked fine to me. You just need CMap for extracting non-ASCII texts (e.g. Japanese). Try doing: $ make cmap on the pdfminer directory. If you already have an old version of cmap, delete and rebuild it. I think at one point there was some issue in cmap generation, and that might be your cause. Sorry for not being clear about this.

JSB97 commented 10 years ago

Thank you Eusuke, but I am still not having luck with this. I've deleted the old pdfminer/cmap folder and ran $ make cmap

again, which gives this sort of output; writing: 'pdfminer/cmap/KSCpc-EUC-H.pickle.gz'... writing: 'pdfminer/cmap/UniKS-UTF16-V.pickle.gz'...

etc. I then run the following; $ ./pdfminer/tools/pdf2txt.py -p 1 -o /Users/Documents/h25yosan2.html /Users/Documents/h25yosan2.pdf but I still get only CID's and no japanese text.

Is there anything else you can point to help resolve this? FYI, the version of python I am using I have attached below... $ python Python 2.6.6 (r266:84374, Aug 31 2010, 11:00:51) [GCC 4.0.1 (Apple Inc. build 5493)] on darwin

euske commented 10 years ago

What's in your $PYTHONPATH variable? There could be a previous version of pdfminer in your system which is responding incorrectly.

JSB97 commented 10 years ago

I checked this, there were older versions of pdfminer that I had to remove, and reinstalled from scratch. I think I need to debug this going into the source code… can you provide any suggestions where I might start?

tastyminerals commented 10 years ago

Hi, I am having the same issue with my pdf file.

The examination of news coverage related to the accession of (cid:51)(cid:82)(cid:79)(cid:68)(cid:81)(cid:71)(cid:15)...

I tried to convert the same document with poppler pdftotext tool and also had this: The examination of news coverage related to the accession of 3RODQG &]HFK 5HSXEOLF %XOJDULD

So it looks like something is wrong with pdf itself. However I've noticed numerous reports of such behavior from other users. I wonder if there is a reliable workaround. I have pdfminer installed on linux machine via pip.

euske commented 10 years ago

I don't think there's any reliable workaround. The thing is that not everything that looks like text on PDF can be easily converted to actual texts. The prime example would be math symbols that is rendered by using math fonts. Converting PDF to text is always a "best effort" approach in the end.

tastyminerals commented 10 years ago

I see, so I thought. I think the best one can do is use Acrobat Pro trial version to manually extract problematic text or, as the last resort, tesseract-orc. PDF is evil.

deanmalmgren commented 10 years ago

I, too, am having an issue with CID font codes. This PDF, for example, has a line ("metacities — defined by UN Habitat as cities with more than 10") that is parsed as a bunch of (cid:%d) characters instead of the actual characters when running pdf2txt.py. Here is the pdf2txt.py output from the 20140328 release after running make cmap.

The font in this PDF is apparently an Adobe font who's CMAP isn't included in your database. How can I extend PDFMiner to be able to handle this situation?

deanmalmgren commented 10 years ago

On further inspection, it looks my issue with the CID font codes is likely because of ligatures, like when the letters f and i appear next to each other they can be combined into the same glyph. What would be the best way to address this challenge in pdfminer?

euske commented 10 years ago

Sadly there's no standard way to address this issue, because the way ligatures are handled is PDF-specific. Unicode actually has characters for ligature, but they're only for backward compatibility and there's no guarantee that a PDF uses it. Sometimes it's rendered by using a special embedded font, whose information is not available to PDFMiner.

deanmalmgren commented 10 years ago

@euske thanks for the heads up. I've got a hacky workaround to deal with the rest of the line that appears to work across a few different PDFs. If it ends up being something useful that can systematically decode the majority of these lines (with the exception of the ligatures), is there a good place I can stick it in the source code?

VladimirStarostenkov commented 6 years ago

@deanmalmgren did you manage to publish the workaround?

deanmalmgren commented 6 years ago

This was ages ago. Forgive me but I don't think I published a workaround and I don't recall what I did to do it or which project it was. eeeee... not very helpful I'm afraid, @VladimirStarostenkov

VladimirStarostenkov commented 6 years ago

@deanmalmgren no worries :) What about textract? It pushed me to the idea that one can do "pdf -> image -> tesseract -> text" Which is a kind of neural network brute force workaround...

mooncrater31 commented 6 years ago

@VladimirStarostenkov Why would one try to use an OCR instead of directly extracting text from a PDF? I'm using Tesseract on a project, and am trying to get rid of it for the PDFs which have extractable text. This is so because it's slow to crop out the meaningful pieces and OCR'ing them, and the OCR is inaccurate sometimes.

So, is the extraction process that unreliable that we have to depend upon the OCR technology?

VladimirStarostenkov commented 6 years ago

@mooncrater31 The short answer would be "yes, for some documents it really is!" In our project we decided to skip documents where cid chars become dominating. And skip plenty of other docs. In the PDF collection we have there are dozens of weird issues. Just one example. Each page contains not only the text on the current page, but also all the text from the next page (printed outside of the visible page). pdfminer does not handle it. PDF is evil. One have to do lot's of sanity checks, deduplications, spell checking etc. to get a healthy textual output! Even then, we are not immune from the creativity of the document creators as the "fancy" page layout can completely destroy word and even character order (unfortunately, out-of-box OCR won't help here).

wanghaisheng commented 6 years ago

perhaps pdf.js may be helpful

raresmosescu commented 3 years ago

I'm having the same issue and I think the only way to minimize error is to extract using both pdfminer and ocr and then when a cid character shows up in the pdfminer output, you cross-check with the ocr. This only works with alphanumerical characters.

euske / pdfminer

Still have issues with CID Characters #39