Closed RyotaUshio closed 2 weeks ago
Basically we can just reuse this snippet: https://github.com/RyotaUshio/obsidian-pdf-plus/blob/d9f9d20084d794c928343ae6860322dcc9f7442a/src/vim/text-structure-parser.ts#L201-L225
But we need to fix the issue that it assumes textDiv.childNodes[0]
is a text node, which can be wrong when search matches are rendered.
Currently, the Obsidian's native PDF viewer can extract annotated text (= text that is marked up by a text markup annotation (highlight etc) written inside the PDF file) only when
textContentItem.char
is present and nonempty (char
is a property ofTextContentItem
present only in Obsidian's version of PDF.js). This is due to how the Obsidian team implementsPDFViewerChild.getTextByRect
.The two PDFs appearing in #223 are good examples. The first one has its text content items with non-empty
char
, whereas in the latter it's empty.Since PDF++ uses the
getTextByRect
method as-is, this restriction also applies to PDF++.However, it's probably possible to fix this on PDF++'s end by measuring the bounding box of text divs using
Range.getBoundingClientRect
and monkey-patching the method.It will be beneficial for some users.