Yuras / pdf-toolbox

A collection of tools for processing PDF files in Haskell
181 stars 25 forks source link

Skip comments (%… til end of line) #78

Closed huppmann closed 1 year ago

huppmann commented 1 year ago

This attempts to skip comments.

I only attempted to locate an appropriate place to embed the functionality in "core"; I have no idea if there's similar work that could stand to be done in "document", but, with the modification made just to "core", a document that caused a parsing exception (

Corrupted "readObjectAtOffset" ["Failed reading: empty"]

) when I ran the example "unpack-and-decrypt-pdf-file" (that's not encrypted) was successfully dumped.

Which is about as much QA as I can give about how well the code will work generally. (It's a short document, so, in particular, performance might well suffer.)

I also have not the foggiest idea if this flies in the face of the standards. (There are so many, none of which I've read, so I only skimmed the wikipedia article and squinted at what python pdfminer does and found a place where it seems like pdf-toolbox might do similarly). The most I can say is that "unpack-and-decrypt-pdf-file" still emitted the "%PDF-1.7" at the beginning of output.

With those provisos aside, recognizing that I may have already made this "unpullable", to restate what's in the commit message:

I've seen documents that contain content like

<<\r/Name /C0L00AON24\r/Subtype /Type3 \r/Type /Font\r% Font C0L00AON\r/CharProcs 24 0 R\r/FontBBox [0.000 -4.000 29.000 30.000]\r/FontMatrix [ 1 0 0 1 0 0 ]\r/FirstChar 64\r/LastChar 249\r/Widths 25 0 R\r/Encoding 26 0 R\r/ToUnicode  27 0 R\r>> 

where the "% Font C0L00AON" line causes parsing to fail.

So this commit augments skipSpace by skipping not only space but also anything that looks like a comment, til end of line, recursively.

At any rate, whether or not you care for this PR, I'm a very happy user of your package. Thanks a lot!

Yuras commented 1 year ago

At any rate, whether or not you care for this PR, I'm a very happy user of your package. Thanks a lot!

I do care! Thank you for the PR. And, if possible, I'd be happy to hear more details on your use case.

Here is the list of things I need to address before merging this. It's mostly for myself, so that I won't forget.

huppmann commented 1 year ago

I do care! Thank you for the PR. And, if possible, I'd be happy to hear more details on your use case.

Glad you are looking to do something about this; can see it being useful for more than just myself. For my part, I can work with my fork. As for my use case, it's a banking document, so I'm afraid I'm not at liberty to share.

Here is the list of things I need to address before merging this. It's mostly for myself, so that I won't forget.

Yes, I'm afraid I'm not going to be helpful with that punch list: I'm just a dabbler and not a software engineer, which is the standard you obviously hold your code to.

* Read PDF spec and figure out whether comments are allowed here and how exactly we should handle them.

As mentioned, I'm not even up to digesting the specs. I can offer one more example of working code that seems to treat comments similarly, though:

https://cgit.freedesktop.org/poppler/poppler/tree/poppler/Lexer.cc?id=cc5ac1665aa3056d1f90a12e12d24a02536647e0#n169

Of course, I recognize, to riff on an old saw, that the plural of implementation is not "standard", but … in case it's useful.

* I think we need to skip space after the comment as well. Though it might be already handled.

The code I submitted recursively calls itself to gobble up more spaces and comments (and, I hope, that's all), if present.

* Also ensure we handle `\r\n`, which is a valid EOL in PDF (it might be already OK though)

The code breaks off at the appearance of either a '\r' or '\n' and lets endOfLine consume whatever line-ending ('\r', '\n', or "\r\n") follows.

Again, thanks

Yuras commented 1 year ago

Could you please take a look at #81

There are two significant differences:

Also there is a minor difference: there is no need to recurse since many already does recursion. I don't think it affects anything at all, but I find it a bit more clear.

Now we have two options:

I'm afraid I'm not at liberty to share

I see. Anyway, I'm glad someone found this package useful.

huppmann commented 1 year ago
* either you incorporate these changes into your PR, or

* I merge [Handle comments as spaces #81](https://github.com/Yuras/pdf-toolbox/pull/81), but then your authorship won't be recorded in git history.
  Please let me know what you prefer.

That's generous, but my drive-by PR bears almost no resemblance to the more polished code in yours, and you also did all the additional work to figure out the correctness and the additional applicability to xrefs.

I'm closing this and looking forward to your merging #81 so i can pull that into my little hobby project (parsing my bank statements to catch up on about a pandemic's worth of overdue personal accounting)

Thanks again for adding this feature, and so quickly.