Open yu-liang-kono opened 9 years ago
Did you manage to find a solution? I am having the same extraction issues with a Dutch language pdf file. Changing the vertical mode explicitly to TRUE didn't work for me unfortunately.
I have this issue when dealing with Japanese pdf. The final solution I got is to build a cmap myself from Adobe's spec and do a post-processing against the output of pdf2txt.py. I don't think it is a universal solution, but it works for me.
While getting text from some PDFs that contained only latin (english) characters, I am still getting the (cid:###) outputs.
For instance, I get (cid:32)(cid:76)(cid:97)(cid:116)(cid:101) (cid:80)(cid:97)(cid:121)(cid:109)(cid:101)(cid:110)(cid:116) (cid:67)(cid:104)(cid:97)(cid:114)(cid:103)(cid:101) (cid:79)(cid:110) (cid:71)(cid:97)(cid:115)
, and when I convert each number to ascii I get LatePaymentChargeOnGas
, which is the correct output.
I ended up solving the problem by creating a class that inherited TextConverter
and had a modified version of the function def handle_undefined_char(self, font, cid):
that converted the cid to ascii and then returned it. Is there an easier way to avoid (cid:###) output, if I know the PDF contains ascii characters?
Using the -c flag in pdf2text.py doesn't seem to help.
This is not a solution, but you might find useful information in this comment.
I am using this solution (temporary fix to continue my work) in python for the issue mentioned above
text_str = '(cid:76)'
if 'cid' in text_str.lower():
[tab] text_str = text_str.strip('(')
[tab] text_str = text_str.strip(')')
[tab] ascii_num = text_str.split(':')[-1]
[tab] ascii_num = int(ascii_num)
[tab] text_val = chr(ascii_num)
return text_val
If anyone has a better way of solving this kindly share it here. Thanks :+1:
I am using this solution (temporary fix to continue my work) in python for the issue mentioned above
text_str = '(cid:76)'
if 'cid' in text_str.lower():
[tab] text_str = text_str.strip('(')
[tab] text_str = text_str.strip(')')
[tab] ascii_num = text_str.split(':')[-1]
[tab] ascii_num = int(ascii_num)
[tab] text_val = chr(ascii_num)
return text_val
If anyone has a better way of solving this kindly share it here. Thanks 👍
This is hardly a fix. The mapping text_val = chr(ascii_num)
is a false assumption.
I am using this solution (temporary fix to continue my work) in python for the issue mentioned above
text_str = '(cid:76)'
if 'cid' in text_str.lower():
[tab] text_str = text_str.strip('(')
[tab] text_str = text_str.strip(')')
[tab] ascii_num = text_str.split(':')[-1]
[tab] ascii_num = int(ascii_num)
[tab] text_val = chr(ascii_num)
return text_val
If anyone has a better way of solving this kindly share it here. Thanks 👍This is hardly a fix. The mapping
text_val = chr(ascii_num)
is a false assumption.
This works for my use case and hence I am using it for now. I do agree its hardly a fix as this doesn't fix the root problem.
Also, can you share some instances where this doesn't work?
I'm trying to extract texts from a pdf, which is in Japanese language, by the following command.
The output xml file contains lots of (cid:%d) unknown characters. So i'm tracing the source code to see what happened. When
PDFCIDFont
cannot find thecid
from itsunicode_map
, it will raise an exception and that's why I see the(cid:%d)
.In
pdffont.py
, there is a piece of code that builds theunicode_map
.When I force it to vertical mode, the
(cid:%d)
unknown character problem is solved.I have no idea how the parameter works, and not sure what is the side effect if set it to
True
. Any idea?