Extract annotations - Githubissues

PhilterPaper commented 3 years ago

( moved over from #142, asked by @carygravel )

[15 February 2021] by carygravel

I don't have any tools for extracting the annotations in a PDF. If I open a PDF in Builder, how can I iterate over the annotations in a page?

[15 February 2021] by PhilterPaper

You want to find all the annotations on a page? I haven't tried it, but you might be able to start at a page's $self->{'Annots'} and work your way through it from there. Are you trying to remove existing annotations from a page, modify them, extract their text, or something else? Googling PDF extract annotations, I see https://unix.stackexchange.com/questions/31521/how-to-extract-annotations-from-pdf-files -- that might be a start. Apparently there are tools out there to do such a thing.

[16 February 2021] by carygravel

You want to find all the annotations on a page? I haven't tried it, but you might be able to start at a page's $self->{'Annots'} and work your way through it from there. Are you trying to remove existing annotations from a page, modify them, extract their text, or something else? Googling PDF extract annotations, I see https://unix.stackexchange.com/questions/31521/how-to-extract-annotations-from-pdf-files -- that might be a start. Apparently there are tools out there to do such a thing.

Whilst I understand that it is unreasonable to expect Builder to read any PDF, I think is reasonable to expect it to roundtrip any PDF that it orginally created. Which is what I am trying to achieve - to extract the annotations from a PDF Builder previously created in order to display them to the user in a GUI.

Your stack exchange link uses Poppler, which is a Linux-only PDF library, which I could probably get to work for me, but is not (easily) available for Windows.

I've had a quick look at $self->{'Annots'}, but can't see my original annotations there, except in the form of the stringified PDF

It would be great if you could expose some API to iterate through them.

PhilterPaper commented 3 years ago

Just to make sure I understand your request, are you talking about annotations in general, such as those added by readers and saved, or just annotations in the original source of the PDF (as generated by PDF::Builder). The latter would not make much sense to me, as you would already have all the annotations at hand.

carygravel commented 3 years ago

Ideally, iterating over all annotations would be a nice goal, but an initial realistic goal should be to be able to parse a PDF that Builder created in a previous session.

Typically, my users scan documents and can add a hidden text layer. Thanks to you, they now have the possibility to add annotations before saving the PDF.

Often, they want to start with an existing PDF and add to it. In order to do so, I would like to be able to read the annotations created previously.

PhilterPaper commented 3 years ago

OK, if I understand you, your users are using a Builder-based program to update (and even add annotations to) a PDF that may have been created by Builder or by some other facility (e.g., from a scanner driver). When you want to extract these annotations, this is what you refer to as a "round trip"? In that case, the annotation would not necessarily be fixed content, so it would indeed be useful to be able to extract it. Then, could they use the annotation facility in Adobe Reader (highlighter icon) or another PDF Reader to (still further) add or update annotations? That would also be useful to extract.

There shouldn't be any real difference in annotations added by Builder (whether during original PDF creation, or reading in and updating an existing PDF) and those added by a Reader (e.g., Adobe) and saved, which is why I haven't been distinguishing between the two. I'm curious as to why you are drawing a line between the two. If Builder can roundtrip annotations added by Builder, it should be able to just as easily read in annotations added by Adobe Reader, etc. Unless the documentation is wrong again... :-(

carygravel commented 3 years ago

OK, if I understand you, your users are using a Builder-based program to update (and even add annotations to) a PDF that may have been created by Builder or by some other facility (e.g., from a scanner driver). When you want to extract these annotations, this is what you refer to as a "round trip"?

Exactly that. We would be round-tripping between PDF and Builder's internal format:

https://en.wikipedia.org/wiki/Round-trip_format_conversion

There shouldn't be any real difference in annotations added by Builder (whether during original PDF creation, or reading in and updating an existing PDF) and those added by a Reader (e.g., Adobe) and saved, which is why I haven't been distinguishing between the two. I'm curious as to why you are drawing a line between the two. If Builder can roundtrip annotations added by Builder, it should be able to just as easily read in annotations added by Adobe Reader, etc. Unless the documentation is wrong again... :-(

You're right, but only if Builder supports all possible options. Hence the first goal should be to read that which it wrote itself.

PhilterPaper commented 3 years ago

Note that there are dozens and dozens of different annotation types, of which PDF::Builder supports only a handful. Would you be happy with only being able to extract the types of annotations that Builder can put in (in the first place)? At least, that would be a reasonable start. I can't see doing other types unless I also add the ability for Builder to create them in the first place.

carygravel commented 3 years ago

Exactly that. We should initially support enough so that Builder can read what it has written. Once that is working, we can expand both as necessary.

PhilterPaper / Perl-PDF-Builder

Extract annotations #147