fuxialexander / org-pdftools

A custom org link type for pdf-tools
GNU General Public License v3.0
336 stars 36 forks source link

Incremental reading of PDF's #22

Open nanjigen opened 4 years ago

nanjigen commented 4 years ago

Incrementally reading PDFs would be a huge boon to many academics using Emacs (of which I believe there are many), and would bring us one step closer to an integrated research environment.

I believe that we can create a workable PDF IR system by mostly stitching together existing tools. In org-noter, we have an extraction mechanism for highlights via its org-noter-create-skeleton function, the relevant section can be found here: https://github.com/weirdNox/org-noter/blob/9ead81d42dd4dd5074782d239b2efddf9b8b7b3d/org-noter.el#L1580

My thinking, which I describe here, is that we have an incremental extraction of highlighted elements, or even more simply an advice-add or function that triggers an extraction via org-noter's code for every highlight made, as its made. Additionally, we can structure the extracted content for org-drill by adding the :drill: tag, making sure the header has a faux subheader so that org-drill recognizes the entry as a card etc.

I use your org-pdftools package, and with org-noter-pdftoolsfacilitating the extracting of a highlight we could have linked entries in ourorg-noterfile back to the highlights position, which can then be drilled viaorg-drill` and whittled down to cleaner entries as described by Supermemo's Piotr Wozniak. I've tested this manually and with a pipeline ending with "clean" cards being eventually sent to Anki, I can attest that it is extremely effective.

My elisp isn't good enough to write such a function, but perhaps someone here could point me in the right direction. I've tried scrutinizing the relevant code in org-noter for clues, but can't really wrap my head around parts of it.

fuxialexander commented 4 years ago
  1. For a new PDF, if you read and annotate it in org-noter with org-noter-pdftools, I believe all annotation will be incrementally added in proper order (page order), which seems would not have the problem you mentioned?

  2. Do you think the solution @UndeadKernel mentioned here https://github.com/weirdNox/org-noter/issues/94#issuecomment-608548791 solves your issue? If what you want is to have a function that sync between all existing annotations with the org file every time the function is called, I think what he mentioned will work, and is feasible to achieve. We already have a variable org-noter-pdftools-use-unique-org-id to let org-noter-pdftools store the annotation id in PDF-tools which is unique and will not change across usage.

  3. I'm not very sure about the connection between org-drill and an incremental skeleton creation function, could you elaborate more?

UndeadKernel commented 4 years ago

Hey @fuxialexander, I was wondering about doing what you mention in your first point above. When I create a new annotation in org-noter, using pdf-annot-add-highlight-markup-annotation for example, am I supposed to see a new headline added with what was highlighted?

Syncing annotation comments with those also defined in org-noter would also be a rather nice thing to have. This way, you can more easily share the commented paper with others.

If you want support programming anything like this. Give me a few pointers and I might be able to implement it myself.

fuxialexander commented 4 years ago

@UndeadKernel You need to select the text and call org-noter-insert-note rather than use pdf-annot-add-highlight-markup-annotation. Then the text will be highlighted (you can tweak the color using custom variables) and a heading will be inserted.

fuxialexander commented 4 years ago

@UndeadKernel For your second question, you might want to look into pdf-info-editannot and org-narrow-to-subtree.

(pdf-info-editannot 'annot-1-18 '((contents . "Some org text")))

This will set the contents of annotation(e.g. highlights) with an ID "annot-1-18" (ID's are saved in the org-pdftools link and thus also in the org-heading property in org-noter files) to "Some org text". And you just need to org-narrow-to-subtree (or some other way to get the text you want) with save-excursion and get the text, and call pdf-info-editannot to insert it.

alessivs commented 4 years ago

@suiokami

I believe that we can create a workable PDF IR system by mostly stitching together existing tools.

Additionally, we can structure the extracted content for org-drill by adding the :drill: tag, making sure the header has a faux subheader so that org-drill recognizes the entry as a card etc.

I believe org-drill integration should only exist as a separate package. IR requires more than stitching existing tools together.

Firstly, org-drill is a terribly designed package (barring a proof-of-concept consideration), and in many sensible aspects mistaken, which makes it a bad choice for coupling. It claims to implement algorithms (SM-5 and SM-8) where only an outline of the algorithms exist; where late or mid-interval repetitions are not adjusted...sensible algorithm-related features are implemented with absolutely no validation (neither theoretical nor user-guided); it measures flipped cards together (not separately); cards with randomic answer sides are also computed together; one bug sat for many months where its implementation of SM-2 was not even doing what it meant to do, with absolutely no notice to the users; etc.

As an alternative to tight integration I suggest to develop an independent package, with an actual focus on IR, that:

org-drill is not enough to implement incremental reading, be it PDFs or any other document format. See: Minimum definition of incremental reading (spoiler: The incremental reading review function is part of a separate algorithm–not one of the SM-ones; this is important).

If you do not wish to implement mechanisms for review of Org headings such as a global priority queue, priority protection, overload management, and so on, it is still possible to implement incremental reading (in the SuperMemo 2000-2004 sense) by using reading lists (which are still prioritized lists). In any case, it is far more than org-drill provides.

Lastly, we are painting ourselves into a corner by limiting the proposed Incremental Reading process to the structure provided by highlights from PDFs. (The assumption here, is that by focusing on highlights we can go back to them from our learning material): A valuable portion of IR is incremental elaboration; this process potentially mutates a neat structure inherited from a PDF document into a more personal "cognitive structure" (if you will), where banking on back-referencing highlights from internalized knowledge material may not be practical (unless, maybe, you're memorizing poetry or lyrics). This new cognitive structure would be the Org reproduction/annotation/summary/elaboration of the PDF, and it is to be the new source of truth, from which active-recall cards will be derived. The PDF will only exist as passive reference, or for book-keeping purposes, or may simply disappear into oblivion.

EDIT: expanded below

My thinking, which I describe here, is that we have an incremental extraction of highlighted elements, or even more simply an advice-add or function that triggers an extraction via org-noter's code for every highlight made, as its made.

The mechanism for this process already exists thanks to a few interactive functions.

The difference with your proposal is:

In short, the obvious: annotating Org is far better than annotating PDF.

The (somewhat) bad news is that it is still up to you how to schedule review of your structured notes. I export to SuperMemo itself (wrapping ox-clip for now) and use it to go back to one of the headings of the PDF-linked Org file when it tells me to (until, in the end, everything to be remembered is managed by SM). There's no reason one cannot do an implementation in pure elisp; I just want to point out that basing it off PDF annotations is an inferior approach, a few conceptions about incremental reading embedded in the proposalmay be slightly mistaken, and why the proposed feature perhaps should not be part of one of the existing noter packages.