JoshData / pdf-redactor

A general purpose PDF text-layer redaction tool for Python 2/3.
Creative Commons Zero v1.0 Universal
184 stars 61 forks source link

numbers in particular costs with decimal #19

Open einnairo opened 5 years ago

einnairo commented 5 years ago

Having trouble with blanking out costs with format 12.00 or 12345.98 or 123.76 The problem is it blanks out whole numbers in pdfs too although not all whole numbers which makes it really weird to me.

What I suspect is if pdfs "encode" whole numbers with decimals too? Meaning something displayed in a pdf as 12 for example is actually 12.00. Below is the code which is from example.py and i run it in console.

red.py:

;encoding=utf-8

from pdf_redactor import redactor, RedactorOptions import re

set options.

redactor_options = RedactorOptions()

redactor_options.content_filters = [ (re.compile(u"Cost Price"), lambda m : ""), (re.compile(u"Cost"), lambda m : ""), (re.compile(u"[0-9](.)[0-9]{2}"), lambda m : ""), #this is my regex for costs with 2 decimals (re.compile(u"Value Price"),lambda m : ""), ] redactor_options.content_replacement_glyphs = ['#', '', '/', '-'] redactor(redactor_options)

python red.py < a.pdf > anew.pdf

python3 red.py < a.pdf > anew.pdf does not work for me.

Would appreciate if anyone can help.

JoshData commented 5 years ago

I think it's just a problem in your regular expression. . is a special character that matches any character, so it is matching digits too --- meaning, the whole pattern is matching 4 digits in addition to 1 + . + 2 digits. To match the decimal point escape the dot with \\.. (In a regular string, you need two backslashes because you want one actual backslash but backslashes are special in Python strings...)

einnairo commented 5 years ago

Thanks for your reply.

I tried both: (re.compile(u"[0-9]*(\\.)[0-9]{2}"), lambda m : ""), (re.compile(u"[0-9]*(\.)[0-9]{2}"), lambda m : ""),

Both did not solve the problem as likewise this removes whole numbers.

Can I send you a couple of the pdf examples that I am using, as I do not want to share POs openly.

JoshData commented 5 years ago

Sorry, I don't really have time to debug it with you.