Incremental reading of PDF's

nanjigen commented 4 years ago

Incrementally reading PDFs would be a huge boon to many academics using Emacs (of which I believe there are many), and would bring us one step closer to an integrated research environment.

I believe that we can create a workable PDF IR system by mostly stitching together existing tools. In org-noter, we have an extraction mechanism for highlights via its org-noter-create-skeleton function, the relevant section can be found here: https://github.com/weirdNox/org-noter/blob/9ead81d42dd4dd5074782d239b2efddf9b8b7b3d/org-noter.el#L1580

My thinking, which I describe here, is that we have an incremental extraction of highlighted elements, or even more simply an advice-add or function that triggers an extraction via org-noter's code for every highlight made, as its made. Additionally, we can structure the extracted content for org-drill by adding the :drill: tag, making sure the header has a faux subheader so that org-drill recognizes the entry as a card etc.

I use your org-pdftools package, and with org-noter-pdftoolsfacilitating the extracting of a highlight we could have linked entries in ourorg-noterfile back to the highlights position, which can then be drilled viaorg-drill` and whittled down to cleaner entries as described by Supermemo's Piotr Wozniak. I've tested this manually and with a pipeline ending with "clean" cards being eventually sent to Anki, I can attest that it is extremely effective.

My elisp isn't good enough to write such a function, but perhaps someone here could point me in the right direction. I've tried scrutinizing the relevant code in org-noter for clues, but can't really wrap my head around parts of it.

fuxialexander commented 4 years ago

For a new PDF, if you read and annotate it in org-noter with org-noter-pdftools, I believe all annotation will be incrementally added in proper order (page order), which seems would not have the problem you mentioned?
Do you think the solution @UndeadKernel mentioned here https://github.com/weirdNox/org-noter/issues/94#issuecomment-608548791 solves your issue? If what you want is to have a function that sync between all existing annotations with the org file every time the function is called, I think what he mentioned will work, and is feasible to achieve. We already have a variable org-noter-pdftools-use-unique-org-id to let org-noter-pdftools store the annotation id in PDF-tools which is unique and will not change across usage.
I'm not very sure about the connection between org-drill and an incremental skeleton creation function, could you elaborate more?

UndeadKernel commented 4 years ago

Hey @fuxialexander, I was wondering about doing what you mention in your first point above. When I create a new annotation in org-noter, using pdf-annot-add-highlight-markup-annotation for example, am I supposed to see a new headline added with what was highlighted?

Syncing annotation comments with those also defined in org-noter would also be a rather nice thing to have. This way, you can more easily share the commented paper with others.

If you want support programming anything like this. Give me a few pointers and I might be able to implement it myself.

fuxialexander commented 4 years ago

@UndeadKernel You need to select the text and call org-noter-insert-note rather than use pdf-annot-add-highlight-markup-annotation. Then the text will be highlighted (you can tweak the color using custom variables) and a heading will be inserted.

fuxialexander commented 4 years ago

@UndeadKernel For your second question, you might want to look into pdf-info-editannot and org-narrow-to-subtree.

(pdf-info-editannot 'annot-1-18 '((contents . "Some org text")))

This will set the contents of annotation(e.g. highlights) with an ID "annot-1-18" (ID's are saved in the org-pdftools link and thus also in the org-heading property in org-noter files) to "Some org text". And you just need to org-narrow-to-subtree (or some other way to get the text you want) with save-excursion and get the text, and call pdf-info-editannot to insert it.

alessivs commented 4 years ago

@suiokami

I believe that we can create a workable PDF IR system by mostly stitching together existing tools.

Additionally, we can structure the extracted content for org-drill by adding the :drill: tag, making sure the header has a faux subheader so that org-drill recognizes the entry as a card etc.

I believe org-drill integration should only exist as a separate package. IR requires more than stitching existing tools together.

Firstly, org-drill is a terribly designed package (barring a proof-of-concept consideration), and in many sensible aspects mistaken, which makes it a bad choice for coupling. It claims to implement algorithms (SM-5 and SM-8) where only an outline of the algorithms exist; where late or mid-interval repetitions are not adjusted...sensible algorithm-related features are implemented with absolutely no validation (neither theoretical nor user-guided); it measures flipped cards together (not separately); cards with randomic answer sides are also computed together; one bug sat for many months where its implementation of SM-2 was not even doing what it meant to do, with absolutely no notice to the users; etc.

As an alternative to tight integration I suggest to develop an independent package, with an actual focus on IR, that:

Records headline IDs
Keeps track of which IDs are active (subject to incremental review), and which ones are done (i.e. dismissed, to be skipped)
- List and visit active headings according to a priority function.

org-drill is not enough to implement incremental reading, be it PDFs or any other document format. See: Minimum definition of incremental reading (spoiler: The incremental reading review function is part of a separate algorithm–not one of the SM-ones; this is important).

If you do not wish to implement mechanisms for review of Org headings such as a global priority queue, priority protection, overload management, and so on, it is still possible to implement incremental reading (in the SuperMemo 2000-2004 sense) by using reading lists (which are still prioritized lists). In any case, it is far more than org-drill provides.

Lastly, we are painting ourselves into a corner by limiting the proposed Incremental Reading process to the structure provided by highlights from PDFs. (The assumption here, is that by focusing on highlights we can go back to them from our learning material): A valuable portion of IR is incremental elaboration; this process potentially mutates a neat structure inherited from a PDF document into a more personal "cognitive structure" (if you will), where banking on back-referencing highlights from internalized knowledge material may not be practical (unless, maybe, you're memorizing poetry or lyrics). This new cognitive structure would be the Org reproduction/annotation/summary/elaboration of the PDF, and it is to be the new source of truth, from which active-recall cards will be derived. The PDF will only exist as passive reference, or for book-keeping purposes, or may simply disappear into oblivion.

EDIT: expanded below

My thinking, which I describe here, is that we have an incremental extraction of highlighted elements, or even more simply an advice-add or function that triggers an extraction via org-noter's code for every highlight made, as its made.

The mechanism for this process already exists thanks to a few interactive functions.

org-noter-create-skeleton reads the bookmarks of the outline section of the PDF, and recreates them in the synced org file with precise locations. In many cases it is a good enough structure to kick off a semantic (branch) review process (i.e. based on, or structured by, a hierarchy of topics marked by document sections)
org-noter-insert-note with a text selection, inserts a precise note and intelligently locates the appropriate Org heading candidates (usu. the corresponding document section of the created skeleton) to place it into; without a text selection, inserts a note anchored to the page.
org-noter-insert-precise-note is a godsend. It is possible to do what the org-noter-create-skeleton does when dealing with bookmark-less, image scan-only, or otherwise problematic PDF (see: What's so hard about PDF text extraction?). Since it doesn't depend on PDF capabilities (except the ability to point to a coordinate of the page), and only deals with Org, you can insert a precise note pointing to anywhere. Run this function interactively whenever you see a structural element you would like to insert into the synced Org file; you remain in control of the outline structure at all times. If the PDF doesn't allow text selection, you fill this heading with your own text, and still have a precise location to go back to for further processing. You cannot do this with highlights.

The difference with your proposal is:

It doesn't bank on the ability of PDF text to be highlighted (works for any document).
Because it doesn't deal with PDF annotations, one is not constrained by the comparatively low number of possible annotation formats permitted by the PDF spec (however neatly the current org-pdftools package helps with this limitation)
You remain in control of the semantic structure; not the author or producer of the PDF.
The PDF is not the source of truth anymore. It is your elaboration of it and ultimately the active recall material that becomes part of long-term memory; it needs not relate to a single PDF.

In short, the obvious: annotating Org is far better than annotating PDF.

The (somewhat) bad news is that it is still up to you how to schedule review of your structured notes. I export to SuperMemo itself (wrapping ox-clip for now) and use it to go back to one of the headings of the PDF-linked Org file when it tells me to (until, in the end, everything to be remembered is managed by SM). There's no reason one cannot do an implementation in pure elisp; I just want to point out that basing it off PDF annotations is an inferior approach, a few conceptions about incremental reading embedded in the proposalmay be slightly mistaken, and why the proposed feature perhaps should not be part of one of the existing noter packages.

fuxialexander / org-pdftools

Incremental reading of PDF's #22