Integrate changes from 'feature_pdfcat' and 'feature_tests' (and lastly, 'master')

acsor commented 5 years ago

This PR includes two primary streams of work. In essence it includes a bigger test suite and the "Table of Contents" feature to the pdfcat utility (whose PYTHONPATH handling has been adjusted accordingly).

It adds several stylistics changes: an almost complete move to the lower camelCase style for functions and methods, which proved to be the prevalent one before the modification, thus almost putting to end issue #12; correct spacing and line lengths that no longer exceed 79 characters, as recommended by PEP 8; added documentation and comments to either the public interface of the library and its private components (addressed for example by #11).

Some algorithms in filters.py have been adjusted more faithfully to their specification, and the interface of the *Decode classes unified: those that were previously called this way have been refactored into *Codec and added an encode() and decode() static method (either implemented or raising a NotImplementedError) if previously missing. There are now even more test cases for these classes too.

A test case has been introduced for decodeStreamData() of filters.py as well, and probably the whole file is now covered (something which could be verified by a coverage tool such as coverage).

Probably one of the persisting problems here is the one from #7. I have noticed that the implementation of LZWCodec coherently encodes/decodes byte streams of up to 10K bytes approximately; after that, a misalignment provokes data corruption. Some work should be performed to solve the little bug (probably just an off-by-one error or a seemingly small inaccuracy).

This series of revisions shall not mark a whole new version of the library, which can be incremented nevertheless, but put it in a cleaner and more stable state for future and more serious enhancements. More improvements and features are yet to come.

Edit: as time went on I had the change to add, well... much more to the initial PR contents. The new commits that I've pushed may not reflect the initial description, but are clearly documented by revisions' comments.

acsor commented 5 years ago

Hey @claird, have a look at Refactor project structure before merging. As I said there, I can incorporate the proposed changes into this very PR if you like it.

acsor commented 5 years ago

The latest two revisions have brought about a major file system refactory and PEP 8-related adjustments, respectively.

acsor commented 5 years ago

Shouldn't we perhaps drop the 4 from the pypdf4 package name? In the distant future PyPDF may undergo several changes, but if we manage to maintain some coherence about the way it is utilized by users, by having import statements as

from pypdf import PdfFileReader, ...

we could simplify upgrades. Take as an example PyPDF5: that would demand an from pypdf5 import ... statement. Without this trailing version number, upgrades from the users could be done seamlessly and without need for code changes.

acsor commented 5 years ago

Hey @claird, I just updated the scripts in samplecode/ (previously Sample_Code/) that you seemed to care about in the Wiki Milestones page:

write "PyPDF4 'Hello, World'" page--the samples and such are in worse shape than I realized

acsor commented 5 years ago

Shouldn't we perhaps drop the 4 from the pypdf4 package name?

Since we seemed a bit idle, I did the change regardless of claird's approval, although I see it very well motivated. In case it is not wanted it will be easy to revert back to the unconvenient pypdf4 name (for no reason choose PyPDF4 though!).

xilopaint commented 5 years ago

There are plenty of good changes here. I hope this PR can be merged soon.

Btw, is there any plans to address some issues created in the PyPDF2 repo?

acsor commented 5 years ago

Btw, is there any plans to address some issues created in the PyPDF2 repo?

Which ones, particularly? I scrolled through it a while ago and tried to scan for relevant issues and unmerged pull requests.

There are plenty of good changes here

Thanks for it :+1:. Actually, I still have a little more that I hope too push in a few days. Many are foundational to the library and would benefit from a peer review. Lacking them, I have developed many more unit tests.

2018-09-23-161958_1365x237_scrot

xilopaint commented 5 years ago

Btw, is there any plans to address some issues created in the PyPDF2 repo?

Which ones, particularly? I scrolled through it a while ago and tried to scan for relevant issues and unmerged pull requests.

I can bring here some examples. One I've been facing is https://github.com/mstamy2/PyPDF2/issues/449.

acsor commented 5 years ago

Hey @DeliciousHair, with reference to #11 I have implemented a new method in PdfFileReader, namely objects(), that lists all the indexed objects definitions in a PDF file. Since it looks like you needed it, would you mind trying it out in some of your code? I have already done extensive unit testing with it and am comfortable to say that it should work reliably.

xilopaint commented 5 years ago

I'm looking forward to seeing this PR merged. Any chance it can be reviewed soon @claird?

acsor commented 5 years ago

The very last commit targets Issue #21. Hopefully this has been solved with minimal effort.

acsor commented 5 years ago

I'm looking forward to seeing this PR merged. Any chance it can be reviewed soon @claird?

Since I'm going to be more busy due to my university duties in a couple of weeks or so, my stream of contributions will not continue indefinitely. I'd like to leave some enduring legacy, so I have worked on establishing those that I deem "proper contribution guidelines". Here is an initial draft and when (and if) I'll be added to the contributors list I'll propose the change to the official repository. Wiki pages don't behave like normal repositories and there isn't such a thing as a pull request.

Would some of you mind to share their own thoughts and suggestions over these few paragraphs? The document is purposefully succinct, otherwise no one would bother to check it out.

claird commented 5 years ago

Shall we spell the word 'TO-DO' or 'TODO'?

I'll probably add a paragraph to the effect that, while we're rigorous with these requirements, we're also liberal with our help. If someone has a line that he believes must be 85 characters, or doesn't know how to retreat to 2.7, someone else with the project almost certainly will know a work-around.

Cameron Laird, vice president We make computers work for people.

On Thu, Sep 27, 2018 at 1:17 PM Oscar notifications@github.com wrote:

I'm looking forward to seeing this PR merged. Any chance it can be reviewed soon @claird https://github.com/claird?

Since I'm going to be more busy due to my university duties in a couple of weeks or so, my stream of contributions will not continue indefinitely. I'd like to leave some enduring legacy, so I have worked on establishing those that I deem "proper contribution guidelines". Here is an initial draft https://github.com/newnone/PyPDF4/wiki/Contributing-Rules and when (and if) I'll be added to the contributors list I'll propose the change to the official repository. Wiki pages don't behave like normal repositories and there isn't such a thing as a pull request.

Would some of you mind to share their own thoughts and suggestions over these few paragraphs? The document is purposefully succinct, otherwise no one would bother to check it out.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/claird/PyPDF4/pull/14#issuecomment-425192861, or mute the thread https://github.com/notifications/unsubscribe-auth/AAbN9OfwvDyDgo__xcQf39mMbrtqFj6_ks5ufRYagaJpZM4WjXVE .

acsor commented 5 years ago

Shall we spell the word 'TO-DO' or 'TODO'?

I advise TO-DO, because that seems to be the default in several IDEs.

claird commented 5 years ago

Oh, my. You're right, of course, and I hadn't noticed. My difficulty is that pylint recognizes only TODO. Oh, well ...

We certainly have more important matters before us.

Cameron Laird, vice president We make computers work for people.

On Sun, Sep 30, 2018 at 4:56 PM Cameron Laird claird@phaseit.net wrote:

Shall we spell the word 'TO-DO' or 'TODO'?

I'll probably add a paragraph to the effect that, while we're rigorous with these requirements, we're also liberal with our help. If someone has a line that he believes must be 85 characters, or doesn't know how to retreat to 2.7, someone else with the project almost certainly will know a work-around.

Cameron Laird, vice president We make computers work for people.

On Thu, Sep 27, 2018 at 1:17 PM Oscar notifications@github.com wrote:

I'm looking forward to seeing this PR merged. Any chance it can be reviewed soon @claird https://github.com/claird?

Since I'm going to be more busy due to my university duties in a couple of weeks or so, my stream of contributions will not continue indefinitely. I'd like to leave some enduring legacy, so I have worked on establishing those that I deem "proper contribution guidelines". Here is an initial draft https://github.com/newnone/PyPDF4/wiki/Contributing-Rules and when (and if) I'll be added to the contributors list I'll propose the change to the official repository. Wiki pages don't behave like normal repositories and there isn't such a thing as a pull request.

Would some of you mind to share their own thoughts and suggestions over these few paragraphs? The document is purposefully succinct, otherwise no one would bother to check it out.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/claird/PyPDF4/pull/14#issuecomment-425192861, or mute the thread https://github.com/notifications/unsubscribe-auth/AAbN9OfwvDyDgo__xcQf39mMbrtqFj6_ks5ufRYagaJpZM4WjXVE .

xilopaint commented 5 years ago

@newnone could you summarize the most important changes of the merged PR? I just replaced PyPDF2 with PyPDF4 on my project and it seems to work fine in my first tests. I would like to know what I can expect in terms of performance, bug fixes and new features.

acsor commented 5 years ago

This series of revisions shall not mark a whole new version of the library, which can be incremented nevertheless, but put it in a cleaner and more stable state for future and more serious enhancements. More improvements and features are yet to come.

As I explained earlier before initiating it, this PR wasn't meant to add new functionalities. At the outset I just wanted to bring it to a cleaner state for future improvements, but then I managed to create something new. To the external user, the new goodies probably are:

PdfReader.objects(), that allows to fetch all the indirect objects from a PDF file.
scripts/codecs.py that gathers the "filters" (LZW, ASCIIHex, ASCII85, ...) through a command-line interface.
Special methods for PdfFileReader and PdfFileWriter (__repr__(), __del__(), __enter__(), __exit__())
Renovated and easier module name: we are in line with PEP 8 and each time you wish to import PyPDF4, you do import pypdf .... from pypdf import * is also more careful with what stuff to include in the namespace.
Other little stuff, such as PdfFileReader.filepath, PdfFileReader.isObjectFree(), PdfFileReader.isClosed(), etc.

Internally, I have brought about changes like:

Fortified unit tests suite
Full compliance with Cross-Reference Stream Dictionaries and Object Streams (at least, this is the point we seem to have reached)
An ObjectStream class definition in generics.py, plus other changes to the same module
Improved sample scripts in samplecode/
Yet other PEP 8 adjustments

Probably there has been more than just this, but I lack the time to list them all even succintly.

xilopaint commented 5 years ago

Btw, is there any plans to address some issues created in the PyPDF2 repo?

Which ones, particularly? I scrolled through it a while ago and tried to scan for relevant issues and unmerged pull requests.

I can bring here some examples. One I've been facing is mstamy2#449.

@newnone, have you already taken some time to look into this issue?

claird / PyPDF4

Integrate changes from 'feature_pdfcat' and 'feature_tests' (and lastly, 'master') #14