[Feature] Better support for annotated text extraction

RyotaUshio commented 2 weeks ago

Currently, the Obsidian's native PDF viewer can extract annotated text (= text that is marked up by a text markup annotation (highlight etc) written inside the PDF file) only when textContentItem.char is present and nonempty (char is a property of TextContentItem present only in Obsidian's version of PDF.js). This is due to how the Obsidian team implements PDFViewerChild.getTextByRect.

The two PDFs appearing in #223 are good examples. The first one has its text content items with non-empty char, whereas in the latter it's empty.

Since PDF++ uses the getTextByRect method as-is, this restriction also applies to PDF++.

[!note] So in short, this is a problem of Obsidian itself, and PDF++ is not to be blamed for this. It just inherits the issue from the Obsidian core.

However, it's probably possible to fix this on PDF++'s end by measuring the bounding box of text divs using Range.getBoundingClientRect and monkey-patching the method.

It will be beneficial for some users.

RyotaUshio commented 2 weeks ago

Basically we can just reuse this snippet: https://github.com/RyotaUshio/obsidian-pdf-plus/blob/d9f9d20084d794c928343ae6860322dcc9f7442a/src/vim/text-structure-parser.ts#L201-L225

But we need to fix the issue that it assumes textDiv.childNodes[0] is a text node, which can be wrong when search matches are rendered.

RyotaUshio commented 2 weeks ago

Released in 0.40.8.

See https://github.com/RyotaUshio/obsidian-pdf-plus/blob/72869449b006ce43b704b3aca3e06fadc6867996/src/patchers/pdf-internals.ts#L855

RyotaUshio / obsidian-pdf-plus

[Feature] Better support for annotated text extraction #224