Skip comments (%… til end of line)

huppmann commented 1 year ago

This attempts to skip comments.

I only attempted to locate an appropriate place to embed the functionality in "core"; I have no idea if there's similar work that could stand to be done in "document", but, with the modification made just to "core", a document that caused a parsing exception (

Corrupted "readObjectAtOffset" ["Failed reading: empty"]

) when I ran the example "unpack-and-decrypt-pdf-file" (that's not encrypted) was successfully dumped.

Which is about as much QA as I can give about how well the code will work generally. (It's a short document, so, in particular, performance might well suffer.)

I also have not the foggiest idea if this flies in the face of the standards. (There are so many, none of which I've read, so I only skimmed the wikipedia article and squinted at what python pdfminer does and found a place where it seems like pdf-toolbox might do similarly). The most I can say is that "unpack-and-decrypt-pdf-file" still emitted the "%PDF-1.7" at the beginning of output.

With those provisos aside, recognizing that I may have already made this "unpullable", to restate what's in the commit message:

I've seen documents that contain content like

<<\r/Name /C0L00AON24\r/Subtype /Type3 \r/Type /Font\r% Font C0L00AON\r/CharProcs 24 0 R\r/FontBBox [0.000 -4.000 29.000 30.000]\r/FontMatrix [ 1 0 0 1 0 0 ]\r/FirstChar 64\r/LastChar 249\r/Widths 25 0 R\r/Encoding 26 0 R\r/ToUnicode  27 0 R\r>>

where the "% Font C0L00AON" line causes parsing to fail.

So this commit augments skipSpace by skipping not only space but also anything that looks like a comment, til end of line, recursively.

At any rate, whether or not you care for this PR, I'm a very happy user of your package. Thanks a lot!

Yuras commented 1 year ago

At any rate, whether or not you care for this PR, I'm a very happy user of your package. Thanks a lot!

I do care! Thank you for the PR. And, if possible, I'd be happy to hear more details on your use case.

Here is the list of things I need to address before merging this. It's mostly for myself, so that I won't forget.

[x] Fix CI. The last change in the repo was few years ago, and CI doesn't work anymore. Would be nice to add new ghc versions, but it's not a blocker. Done: https://github.com/Yuras/pdf-toolbox/pull/80
[x] Read PDF spec and figure out whether comments are allowed here and how exactly we should handle them. Done: according to section 7.2.2 any occurrence of % outside of a string or a stream starts a comment, which ends at the end of line, but doesn't includes the EOF itself. Comments should be handled as a single white space character.
[x] Add test case for the change. Probably just add a unit test similar to this but with a comment inside. Done: https://github.com/Yuras/pdf-toolbox/pull/81
[x] I think we need to skip space after the comment as well. Though it might be already handled. Done: yes, it's already handled
[x] Also ensure we handle \r\n, which is a valid EOL in PDF (it might be already OK though) Done: yes,it's already handled

huppmann commented 1 year ago

I do care! Thank you for the PR. And, if possible, I'd be happy to hear more details on your use case.

Glad you are looking to do something about this; can see it being useful for more than just myself. For my part, I can work with my fork. As for my use case, it's a banking document, so I'm afraid I'm not at liberty to share.

Here is the list of things I need to address before merging this. It's mostly for myself, so that I won't forget.

Yes, I'm afraid I'm not going to be helpful with that punch list: I'm just a dabbler and not a software engineer, which is the standard you obviously hold your code to.

* Read PDF spec and figure out whether comments are allowed here and how exactly we should handle them.

As mentioned, I'm not even up to digesting the specs. I can offer one more example of working code that seems to treat comments similarly, though:

https://cgit.freedesktop.org/poppler/poppler/tree/poppler/Lexer.cc?id=cc5ac1665aa3056d1f90a12e12d24a02536647e0#n169

Of course, I recognize, to riff on an old saw, that the plural of implementation is not "standard", but … in case it's useful.

* I think we need to skip space after the comment as well. Though it might be already handled.

The code I submitted recursively calls itself to gobble up more spaces and comments (and, I hope, that's all), if present.

* Also ensure we handle `\r\n`, which is a valid EOL in PDF (it might be already OK though)

The code breaks off at the appearance of either a '\r' or '\n' and lets endOfLine consume whatever line-ending ('\r', '\n', or "\r\n") follows.

Again, thanks

Yuras commented 1 year ago

Could you please take a look at #81

There are two significant differences:

test case was added, and
now we handle comments in xref as well

Also there is a minor difference: there is no need to recurse since many already does recursion. I don't think it affects anything at all, but I find it a bit more clear.

Now we have two options:

either you incorporate these changes into your PR, or
I merge #81, but then your authorship won't be recorded in git history. Please let me know what you prefer.

I'm afraid I'm not at liberty to share

I see. Anyway, I'm glad someone found this package useful.

huppmann commented 1 year ago

* either you incorporate these changes into your PR, or

* I merge [Handle comments as spaces #81](https://github.com/Yuras/pdf-toolbox/pull/81), but then your authorship won't be recorded in git history.
  Please let me know what you prefer.

That's generous, but my drive-by PR bears almost no resemblance to the more polished code in yours, and you also did all the additional work to figure out the correctness and the additional applicability to xrefs.

I'm closing this and looking forward to your merging #81 so i can pull that into my little hobby project (parsing my bank statements to catch up on about a pandemic's worth of overdue personal accounting)

Thanks again for adding this feature, and so quickly.

Yuras / pdf-toolbox

Skip comments (%… til end of line) #78