JoshData / pdf-redactor

A general purpose PDF text-layer redaction tool for Python 2/3.
Creative Commons Zero v1.0 Universal
180 stars 61 forks source link

pdfrw now decode Unicode strings in Python 3 #11

Closed tridemax closed 5 years ago

tridemax commented 6 years ago

As pdfrw now uses Unicode for PdfString (https://github.com/pmaupin/pdfrw/commit/d8a9292ad651dfdfc674f38121198cf1bc10240d) pdf-redactor fails with an error on this new version:

"pdf_redactor.py", line 676, in toUnicode
    string = string.encode("Latin-1")
UnicodeEncodeError: 'latin-1' codec can't encode characters in position 4-7: ordinal not in range(256)
JoshData commented 6 years ago

Thanks. That should simplify things. But it will take some testing to update this library. Would be glad to have some help.

tridemax commented 6 years ago

Not a problem - I’ve tried to fix it right away but had some issues, probably because of loose understanding of what is going on with Unicode differences for Python 2 and 3. But testing and fixing should be a bit easier if you can provide a general direction.

JoshData commented 6 years ago

There are a few places that I had to work around weird Latin-1 encoding:

https://github.com/JoshData/pdf-redactor/search?utf8=%E2%9C%93&q=latin&type=

Those parts might be no longer necessary, which would be great.

Other than that, off hand I don't really know. :)

anilkumar-pcs commented 6 years ago

Hey @JoshData

I am facing issue with executing the example.py with the test pdf i.e. /tests/test-ssns.pdf Error: Traceback (most recent call last): File ".\example.py", line 47, in <module> pdf_redactor.redactor(options) File "E:\BankerBay\Python\PDFEdit\pdf-redactor-master\pdf_redactor.py", line 110, in redactor text_layer = build_text_layer(document, options) File "E:\BankerBay\Python\PDFEdit\pdf-redactor-master\pdf_redactor.py", line 488, in build_text_layer prev_token[i] = make_mutable_string_token(prev_token[i]) File "E:\BankerBay\Python\PDFEdit\pdf-redactor-master\pdf_redactor.py", line 460, in make_mutable_string_token token = TextToken(token.decode(), current_font) File "E:\BankerBay\Python\PDFEdit\pdf-redactor-master\pdf_redactor.py", line 410, in __init__ self.original_value = toUnicode(value, font, fontcache) File "E:\BankerBay\Python\PDFEdit\pdf-redactor-master\pdf_redactor.py", line 699, in toUnicode fontcache[font.ToUnicode.stream] = CMap(font.ToUnicode) File "E:\BankerBay\Python\PDFEdit\pdf-redactor-master\pdf_redactor.py", line 639, in __init__ add_mapping(code_to_int(code), char) File "E:\BankerBay\Python\PDFEdit\pdf-redactor-master\pdf_redactor.py", line 560, in add_mapping code = bytes([code]) ValueError: bytes must be in range(0, 256)

Can you please help me fixing this?

Thanks

JoshData commented 6 years ago

Unfortunately I'm not going to have time to look into it for at least a few weeks, sorry.

JoshData commented 6 years ago

I've pushed a fix for using pdfrw 0.4. Let me know if it solves your problem!

Thanks for reporting the issue.