KolbyFlipper / pdfRedact

Takes in an input directory and a redaction PDF, redacts all PDFs in input directory
1 stars 1 forks source link

pdftotext #1

Open fletchy95 opened 4 years ago

fletchy95 commented 4 years ago

I sometimes run in to the issue of PyPDF2 not working with certain pdfs but pdftotext does. Is there any plans/solutions to have this library run pdftotext instead or as an option?

KolbyFlipper commented 4 years ago

If I had to guess, it’s due to your PDFs being a different format than the ones I was using this program on; there was an iteration of this code on my machine that used pdftotext, but it didn’t work for my PDFs so I assumed it would not work for any PDFs.

I will look into it and see if I can implement a way to toggle between it and PyPDF2 this weekend

KolbyFlipper commented 4 years ago

The way that it currently works, it is superimposing a redaction onto an existing PDF, and then flattening out the underlying data. To use pdftotext, you'd have to know exactly what text strings you're redacting from the existing PDF, and then recreate a new, identical PDF to the original with those strings removed.

Can you explain a little bit more about your use case/how you're using PDFtotext??