euske / pdfminer

Python PDF Parser (Not actively maintained). Check out pdfminer.six.
https://github.com/pdfminer/pdfminer.six
MIT License
5.25k stars 1.13k forks source link

pdf2txt.py get (cid:%d) unknown char #102

Open yu-liang-kono opened 9 years ago

yu-liang-kono commented 9 years ago

I'm trying to extract texts from a pdf, which is in Japanese language, by the following command.

python pdf2txt.py -p 1 -o 1.xml -t xml -V -A pdf

The output xml file contains lots of (cid:%d) unknown characters. So i'm tracing the source code to see what happened. When PDFCIDFont cannot find the cid from its unicode_map, it will raise an exception and that's why I see the (cid:%d).

In pdffont.py, there is a piece of code that builds the unicode_map.

self.unicode_map = CMapDB.get_unicode_map(self.cidcoding, self.cmap.is_vertical())

When I force it to vertical mode, the (cid:%d) unknown character problem is solved.

self.unicode_map = CMapDB.get_unicode_map(self.cidcoding, True)

I have no idea how the parameter works, and not sure what is the side effect if set it to True. Any idea?

jaspoor commented 9 years ago

Did you manage to find a solution? I am having the same extraction issues with a Dutch language pdf file. Changing the vertical mode explicitly to TRUE didn't work for me unfortunately.

yu-liang-kono commented 9 years ago

I have this issue when dealing with Japanese pdf. The final solution I got is to build a cmap myself from Adobe's spec and do a post-processing against the output of pdf2txt.py. I don't think it is a universal solution, but it works for me.

rkargon commented 9 years ago

While getting text from some PDFs that contained only latin (english) characters, I am still getting the (cid:###) outputs. For instance, I get (cid:32)(cid:76)(cid:97)(cid:116)(cid:101) (cid:80)(cid:97)(cid:121)(cid:109)(cid:101)(cid:110)(cid:116) (cid:67)(cid:104)(cid:97)(cid:114)(cid:103)(cid:101) (cid:79)(cid:110) (cid:71)(cid:97)(cid:115), and when I convert each number to ascii I get LatePaymentChargeOnGas, which is the correct output.

I ended up solving the problem by creating a class that inherited TextConverter and had a modified version of the function def handle_undefined_char(self, font, cid): that converted the cid to ascii and then returned it. Is there an easier way to avoid (cid:###) output, if I know the PDF contains ascii characters? Using the -c flag in pdf2text.py doesn't seem to help.

lucanaso commented 8 years ago

This is not a solution, but you might find useful information in this comment.

Mahendra114027 commented 5 years ago

I am using this solution (temporary fix to continue my work) in python for the issue mentioned above

text_str = '(cid:76)' if 'cid' in text_str.lower(): [tab] text_str = text_str.strip('(') [tab] text_str = text_str.strip(')') [tab] ascii_num = text_str.split(':')[-1] [tab] ascii_num = int(ascii_num) [tab] text_val = chr(ascii_num) return text_val

If anyone has a better way of solving this kindly share it here. Thanks :+1:

royjohal commented 4 years ago

I am using this solution (temporary fix to continue my work) in python for the issue mentioned above

text_str = '(cid:76)' if 'cid' in text_str.lower(): [tab] text_str = text_str.strip('(') [tab] text_str = text_str.strip(')') [tab] ascii_num = text_str.split(':')[-1] [tab] ascii_num = int(ascii_num) [tab] text_val = chr(ascii_num) return text_val

If anyone has a better way of solving this kindly share it here. Thanks 👍

This is hardly a fix. The mapping text_val = chr(ascii_num) is a false assumption.

Mahendra114027 commented 4 years ago

I am using this solution (temporary fix to continue my work) in python for the issue mentioned above text_str = '(cid:76)' if 'cid' in text_str.lower(): [tab] text_str = text_str.strip('(') [tab] text_str = text_str.strip(')') [tab] ascii_num = text_str.split(':')[-1] [tab] ascii_num = int(ascii_num) [tab] text_val = chr(ascii_num) return text_val If anyone has a better way of solving this kindly share it here. Thanks 👍

This is hardly a fix. The mapping text_val = chr(ascii_num) is a false assumption.

This works for my use case and hence I am using it for now. I do agree its hardly a fix as this doesn't fix the root problem.

Also, can you share some instances where this doesn't work?