How to Permanently Mutate PDFs via Annotation?

fonol / anki-search-inside-add-card

An add-on providing full-text-search and PDF reading functionality to Anki's Add card dialog

https://ankiweb.net/shared/info/1781298089

GNU Affero General Public License v3.0

178 stars 24 forks source link

How to Permanently Mutate PDFs via Annotation? #94

Closed TheCedarPrince closed 4 years ago

TheCedarPrince commented 4 years ago

Hi @fonol ,

This is a fantastic extension that I strongly believe has a lot of opportunity for changing work flows for a great many people. Especially by giving a tool that is usable for incremental reading. Quite excited!

A few questions that I had in regards to the extensions are as follows:

How can I permanently embed annotations (comments/highlights) into a given PDF?
Where are annotations being currently stored?
If the functionality is non-existent, how difficult would it be to mutate PDFs from within Anki, permanently?

There are a few reasons I want this functionality being that I generally share PDFs I annotate with others, I have my PDFs synced to a cloud service so I can make annotations on any platform, and I use Zotero's zotfile extension to pull all my annotations out of the PDF for extended write-ups later on.

From what I can gather from your extension, you point the extension to a given PDF on your machine. From there, the PDF is then read into the extension with any previously made annotations. I tested out importing a file, then annotating it with a program I have called FoxitReader that permanently adds annotations into the PDF itself, and was able to successfully see that my highlights from that software showed up into the Anki extension. However, when I annotate a PDF inside of Anki, the annotations are not saved to the file. Based on your description, sounds like all annotations associated with a file goes to a DB somewhere on one's machine.

I would greatly appreciate this functionality in the extension. Perhaps in the form of a setting that you can click to store annotations in PDF or not.

Does that all make sense? Thanks for the great work!

fonol commented 4 years ago

Hi, I am happy you like it. And with your described workflow, I totally understand your need for that feature.

You can't really. PDF viewers based on PDF.js are basically read-only.
There is an SQLite DB, by default in your add-ons user_files folder, called siac-notes.db. It contains your add-on's notes, as well as the pages marked as read, annotations and note priorities.
Well, I guess one could use the coordinates of the annotations as they exist now, and use some python library that allows to add annotations to PDFs, e.g. a quick google shows me https://github.com/plangrid/pdf-annotate . There would then be the problem of how to remove or edit them, as pdf.js will display them, but to my knowledge, not allow any interaction. So to be able to edit/delete them, we probably would have to convert the embedded annotations to the add-on annotations on opening the PDF, and convert back on closing. It sounds like quite some work, but I will happily accept pull requests, if you dare to work through my spaghetti code :).

TheCedarPrince commented 4 years ago

Hey @fonol ,

I would be willing to do some work on this! I just have a few additional questions:

After looking around your code base, it would appear that the notes.py file is where annotations are made and sent to the DB. I had trouble identifying the function that allows one to retrieve annotations for a particular PDF. How does this function work in your code?
I checked out pdf-annotate. Seems very promising for what we are both thinking. So just to be clear on what you are thinking, the flow would go as follows:
1. One opens a file into the extension without any annotations
2. The file is then annotated in the extension
3. Annotations are made and saved to the DB separately
4. If user desires, a setting could be selected to write these annotations upon extension close to the file
5. Upon close of extension or perhaps a button someone pushes, annotations are either saved to the file or only to the database.
6. Upon opening of the annotated file, a user could click a button to convert annotations to editable annotations inside of the extension.

Is that what you have in mind?

If I can prototype this general flow and necessary functionality, could I have assistance integrating it into your code?

Thanks @fonol ! Take care and I look forward to your response!

fonol commented 4 years ago

You are right, the function is in notes.py, it is called _gethighlights. If you do the pdf annotation part, I can "wire it up" with the rest of the add-on, which can be a bit hard to get into (due to general messy structure).

What your added pdf annotation module should be able to do is basically:

given a page, get the embedded annotations (don't really care in what form, important stuff is top left x and y coord, bottom right x and y coordinate (that's how they are stored currently) for the rectangle, and of course the text content.
given a page and a list of annotations (coords + text), delete all existing embedded annotations on that page, and insert the given ones. I guess it is easier that way, to simply overwrite all existing embedded annotations when something is changed in the add-on, than to try to do some mapping of the add-on's ones to the embedded ones.

A quick glance at the above mentioned lib didn't show me if it supports deletion of annotations though?

TheCedarPrince commented 4 years ago

Hey @fonol ,

Ah thank you so much for helping me with wiring it up! I'll see about creating a few additional methods that can utilize your get_highlights method to create permanent annotations.

A few more additional questions:

Could you explain more on what you mean by "delete all existing embedded annotations"? Does that mean if an arbitrary file is imported with already made annotations, they are erased and thrown out? Or should I be storing them as well? At face value, it seems to not help with portability of the file with embedded annotation.
I am confused by what you are saying here: "I guess it is easier that way, to simply overwrite all existing embedded annotations when something is changed in the add-on, than to try to do some mapping of the add-on's ones to the embedded ones." I am confused about what you are suggesting in pertaining to handle annotations. So do you want the set-up to handle only annotations that come from the extension or what are you saying here?

So, you are correct. Upon a quick investigation of that particular Python library, one cannot write to a pdf. Furthermore, I discovered that almost all Python based PDF manipulation libraries do not have this feature. All of them do have the means to identify where a PDF annotation is however and to permanently embed them to the PDF.

However, I did successfully find this library that has both needed features: https://github.com/mstamy2/PyPDF2

If I cannot remove annotations, what would you suggest we do? I could try looking at other JS libraries as well if that could help.

Thanks!

fonol commented 4 years ago

Hi, I meant the following: Suppose you are in the add-on, on certain page, the add-on has "converted" the embedded annotations and displays them in the way they are currently shown. You now edit the text of one of the annotations on the page. The way I imagined it, if you have some config option checked, it should now directly update the embedded annotation. But as I imagine, it might be cumbersome to now only target that one specific embedded annotation on that page (i.e. delete and then recreate it with the updated text). So it might be easier to just delete all on the current page, take all displayed on the current page in the add-on, and embed them. But I haven't looked at the python libs at all, and maybe it is easy to target a specific annotation on a page and delete/update it. Then by all means please just completely ignore what I wrote!

If there is no lib that can remove/update annotations, I honestly don't really know what to do. I suspect you won't find an open source javascript library more advanced than pdf.js, so I guess your best bet would be Python.

TheCedarPrince commented 4 years ago

Hey @fonol

I apologize but unfortunately, after much review of current PDF interaction libraries via Python, I cannot find the features we would need. :disappointed:

I am going to close this issue now as though I think this is a great feature too add, I think the capabilities are not there for me to easily plug into your code. Apologies for having taken your time but I wish you the best on future development.

~ TheCedarPrince