Open JSB97 opened 10 years ago
Your sample code worked fine to me. You just need CMap for extracting non-ASCII texts (e.g. Japanese). Try doing: $ make cmap on the pdfminer directory. If you already have an old version of cmap, delete and rebuild it. I think at one point there was some issue in cmap generation, and that might be your cause. Sorry for not being clear about this.
Thank you Eusuke, but I am still not having luck with this. I've deleted the old pdfminer/cmap folder and ran $ make cmap
again, which gives this sort of output; writing: 'pdfminer/cmap/KSCpc-EUC-H.pickle.gz'... writing: 'pdfminer/cmap/UniKS-UTF16-V.pickle.gz'...
etc. I then run the following; $ ./pdfminer/tools/pdf2txt.py -p 1 -o /Users/Documents/h25yosan2.html /Users/Documents/h25yosan2.pdf but I still get only CID's and no japanese text.
Is there anything else you can point to help resolve this? FYI, the version of python I am using I have attached below... $ python Python 2.6.6 (r266:84374, Aug 31 2010, 11:00:51) [GCC 4.0.1 (Apple Inc. build 5493)] on darwin
What's in your $PYTHONPATH variable? There could be a previous version of pdfminer in your system which is responding incorrectly.
I checked this, there were older versions of pdfminer that I had to remove, and reinstalled from scratch. I think I need to debug this going into the source code… can you provide any suggestions where I might start?
Hi, I am having the same issue with my pdf file.
The examination of news coverage related to the accession of (cid:51)(cid:82)(cid:79)(cid:68)(cid:81)(cid:71)(cid:15)...
I tried to convert the same document with poppler pdftotext
tool and also had this:
The examination of news coverage related to the accession of 3RODQG &]HFK 5HSXEOLF %XOJDULD
So it looks like something is wrong with pdf itself. However I've noticed numerous reports of such behavior from other users. I wonder if there is a reliable workaround. I have pdfminer
installed on linux machine via pip.
I don't think there's any reliable workaround. The thing is that not everything that looks like text on PDF can be easily converted to actual texts. The prime example would be math symbols that is rendered by using math fonts. Converting PDF to text is always a "best effort" approach in the end.
I see, so I thought. I think the best one can do is use Acrobat Pro trial version to manually extract problematic text or, as the last resort, tesseract-orc. PDF is evil.
I, too, am having an issue with CID font codes. This PDF, for example, has a line ("metacities — defined by UN Habitat as cities with more than 10") that is parsed as a bunch of (cid:%d)
characters instead of the actual characters when running pdf2txt.py
. Here is the pdf2txt.py output from the 20140328 release after running make cmap
.
The font in this PDF is apparently an Adobe font who's CMAP isn't included in your database. How can I extend PDFMiner to be able to handle this situation?
On further inspection, it looks my issue with the CID font codes is likely because of ligatures, like when the letters f
and i
appear next to each other they can be combined into the same glyph. What would be the best way to address this challenge in pdfminer?
Sadly there's no standard way to address this issue, because the way ligatures are handled is PDF-specific. Unicode actually has characters for ligature, but they're only for backward compatibility and there's no guarantee that a PDF uses it. Sometimes it's rendered by using a special embedded font, whose information is not available to PDFMiner.
@euske thanks for the heads up. I've got a hacky workaround to deal with the rest of the line that appears to work across a few different PDFs. If it ends up being something useful that can systematically decode the majority of these lines (with the exception of the ligatures), is there a good place I can stick it in the source code?
@deanmalmgren did you manage to publish the workaround?
This was ages ago. Forgive me but I don't think I published a workaround and I don't recall what I did to do it or which project it was. eeeee... not very helpful I'm afraid, @VladimirStarostenkov
@deanmalmgren no worries :) What about textract? It pushed me to the idea that one can do "pdf -> image -> tesseract -> text" Which is a kind of neural network brute force workaround...
@VladimirStarostenkov Why would one try to use an OCR instead of directly extracting text from a PDF? I'm using Tesseract on a project, and am trying to get rid of it for the PDFs which have extractable text. This is so because it's slow to crop out the meaningful pieces and OCR'ing them, and the OCR is inaccurate sometimes.
So, is the extraction process that unreliable that we have to depend upon the OCR technology?
@mooncrater31 The short answer would be "yes, for some documents it really is!" In our project we decided to skip documents where cid chars become dominating. And skip plenty of other docs. In the PDF collection we have there are dozens of weird issues. Just one example. Each page contains not only the text on the current page, but also all the text from the next page (printed outside of the visible page). pdfminer does not handle it. PDF is evil. One have to do lot's of sanity checks, deduplications, spell checking etc. to get a healthy textual output! Even then, we are not immune from the creativity of the document creators as the "fancy" page layout can completely destroy word and even character order (unfortunately, out-of-box OCR won't help here).
perhaps pdf.js may be helpful
I'm having the same issue and I think the only way to minimize error is to extract using both pdfminer and ocr and then when a cid character shows up in the pdfminer output, you cross-check with the ocr. This only works with alphanumerical characters.
I am trying to extract information from this file; http://www.kantei.go.jp/jp/singi/tiiki/siryou/pdf/h25yosan2.pdf
Following the example code on the pdfminer website, I put together this simple code which tries to extract text using LTTextBoxHorizontal class, I get the output as
and not the Japanese unicode characters. I get similar results when using the pdf2txt.py tool.
Could someone suggest what I should do to resolve this? Thank you in advance.
Code