Bladieblah / xpdf-python

Python wrapper around the pdftotext functionality of xpdf
GNU General Public License v3.0
2 stars 2 forks source link

Feature Request: add password option and page control options #4

Closed ReMiOS closed 1 year ago

ReMiOS commented 1 year ago

Since some PDF files are protected with a password an option to provide a password would be useful.

By default this wrapper library uses textOutTableLayout, which is best in most cases. However sometimes i have better results using textOutLinePrinter or textOutSimpleLayout Is it possible to add this option as a config option ?

from: pdftotext.cc

ownerPassword
userPassword

textOutControl.mode = textOutTableLayout;
textOutControl.mode = textOutPhysLayout;
textOutControl.mode = textOutSimpleLayout;
textOutControl.mode = textOutSimple2Layout;
textOutControl.mode = textOutLinePrinter;

# Update:

I've managed to make some changes and now the page layout is selectable with option mode (defaults to table) Since i have little experience in C++ there's probably a nicer way to achieve this... I don't see on how to add an option for the ownerPassword and userPassword, any help is appreciated :)

PdfLoader.h

class PdfLoader {
...
private:
  char *table;
  char *simple;
  char *lineprinter;
  char *physical;

PdfLoader.cc

  table = "table";
  simple = "simple";
  lineprinter = "lineprinter";
  physical = "physical";

  if ( strcmp( config.mode, table ) == 0 ) {
     textOutControl.mode = textOutTableLayout;
  } else if ( strcmp( config.mode, simple ) == 0 ) {
    textOutControl.mode = textOutSimpleLayout;
  } else if ( strcmp( config.mode, lineprinter ) == 0 ) {
    textOutControl.mode = textOutLinePrinter;
  } else if ( strcmp( config.mode, physical ) == 0 ) {
    textOutControl.mode = textOutPhysLayout;
  }

PdfLoaderWrapper.cc

PyObject *construct(PyObject *self, PyObject *args) {
...
    PyArg_ParseTuple(args, "Osppppp", &pobj0,
        &(config.mode),
...
    );

pdf_loader.pxi

class PdfLoader:
    def __init__(
        self,
        filename: str,
    mode: str,

pdf_loader.py

class PdfLoader:
    filename: str
    capsule = None

    def __init__(
        self,
        filename: str,
        mode: str = "table",
Bladieblah commented 1 year ago

I've added support for the 4 modes you mentioned, password stuff is also in progress!

ReMiOS commented 1 year ago

Great, i've tested the mode settings and they work nicely

also like the workaround to use a integer instead of the "python string" to "c++ char *"

ReMiOS commented 1 year ago

Added a test pdf with userpassword = password ownerpassword = password2

PWD_Test.pdf

ReMiOS commented 1 year ago

Tried the "add-password-support" Branch, but it looks the password option does not pass any value to PDFDoc.cc

both variables ownerPassword and userPassword in PDFDoc.cc are NULL when values are given to _loader = PdfLoader('PWDtest.pdf',"password2","password" ) in test a python script

type:    class GString * __ptr64
value:   0000000000000000

When i put in the values manually in PdfLoader.cc the PDF gets decoded correctly

  GString *ownerPasswordGS = new GString("password2");
  GString *userPassword = new GString("password");
  doc = new PDFDoc(fileName, ownerPasswordGS , userPassword);

since both owner and userpass are used in the test PDF only one pass also works

  doc = new PDFDoc(fileName, ownerPasswordGS , NULL);
  doc = new PDFDoc(fileName, NULL , userPassword);
ReMiOS commented 1 year ago

Thanks for your help ! Also PDF files with password work now :)