Closed huppmann closed 1 year ago
At any rate, whether or not you care for this PR, I'm a very happy user of your package. Thanks a lot!
I do care! Thank you for the PR. And, if possible, I'd be happy to hear more details on your use case.
Here is the list of things I need to address before merging this. It's mostly for myself, so that I won't forget.
%
outside of a string or a stream starts a comment, which ends at the end of line, but doesn't includes the EOF itself. Comments should be handled as a single white space character.\r\n
, which is a valid EOL in PDF (it might be already OK though)
Done: yes,it's already handledI do care! Thank you for the PR. And, if possible, I'd be happy to hear more details on your use case.
Glad you are looking to do something about this; can see it being useful for more than just myself. For my part, I can work with my fork. As for my use case, it's a banking document, so I'm afraid I'm not at liberty to share.
Here is the list of things I need to address before merging this. It's mostly for myself, so that I won't forget.
Yes, I'm afraid I'm not going to be helpful with that punch list: I'm just a dabbler and not a software engineer, which is the standard you obviously hold your code to.
* Read PDF spec and figure out whether comments are allowed here and how exactly we should handle them.
As mentioned, I'm not even up to digesting the specs. I can offer one more example of working code that seems to treat comments similarly, though:
Of course, I recognize, to riff on an old saw, that the plural of implementation is not "standard", but … in case it's useful.
* I think we need to skip space after the comment as well. Though it might be already handled.
The code I submitted recursively calls itself to gobble up more spaces and comments (and, I hope, that's all), if present.
* Also ensure we handle `\r\n`, which is a valid EOL in PDF (it might be already OK though)
The code breaks off at the appearance of either a '\r' or '\n' and lets
endOfLine
consume whatever line-ending ('\r', '\n', or "\r\n") follows.
Again, thanks
Could you please take a look at #81
There are two significant differences:
Also there is a minor difference: there is no need to recurse since many
already does recursion. I don't think it affects anything at all, but I find it a bit more clear.
Now we have two options:
I'm afraid I'm not at liberty to share
I see. Anyway, I'm glad someone found this package useful.
* either you incorporate these changes into your PR, or * I merge [Handle comments as spaces #81](https://github.com/Yuras/pdf-toolbox/pull/81), but then your authorship won't be recorded in git history. Please let me know what you prefer.
That's generous, but my drive-by PR bears almost no resemblance to the more polished code in yours, and you also did all the additional work to figure out the correctness and the additional applicability to xrefs.
I'm closing this and looking forward to your merging #81 so i can pull that into my little hobby project (parsing my bank statements to catch up on about a pandemic's worth of overdue personal accounting)
Thanks again for adding this feature, and so quickly.
This attempts to skip comments.
I only attempted to locate an appropriate place to embed the functionality in "core"; I have no idea if there's similar work that could stand to be done in "document", but, with the modification made just to "core", a document that caused a parsing exception (
) when I ran the example "unpack-and-decrypt-pdf-file" (that's not encrypted) was successfully dumped.
Which is about as much QA as I can give about how well the code will work generally. (It's a short document, so, in particular, performance might well suffer.)
I also have not the foggiest idea if this flies in the face of the standards. (There are so many, none of which I've read, so I only skimmed the wikipedia article and squinted at what python pdfminer does and found a place where it seems like pdf-toolbox might do similarly). The most I can say is that "unpack-and-decrypt-pdf-file" still emitted the "%PDF-1.7" at the beginning of output.
With those provisos aside, recognizing that I may have already made this "unpullable", to restate what's in the commit message:
I've seen documents that contain content like
where the "% Font C0L00AON" line causes parsing to fail.
So this commit augments
skipSpace
by skipping not only space but also anything that looks like a comment, til end of line, recursively.At any rate, whether or not you care for this PR, I'm a very happy user of your package. Thanks a lot!