Closed sailist closed 3 years ago
I guess mupdf has access to the text -- a list of text objects anchored to the pages.
Thanks, I have some ideas now, and seems you didn't provide an api to get text and its boundary infomation?
I'm not the maintainer of this repo -- just watching this because I find this interesting, and may start building some kind of viewer based on this and Avalonia. :)
Hi! Yes, that is correct, MuPDF has a way of returning the text in the page, but there is currently no managed api in MuPDFCore to access that information.
TL;DR: implementing this requires a good amount of work; I will try to look at if when I have time, but I have no idea when that will be.
Implementing such an API would most likely not be exactly trivial:
fz_stext_page
.fz_stext_block
s from the fz_stext_page
.fz_stext_block
has a bounding box and contains a single image or a list of lines.
fz_stext_line
s.
fz_stext_line
has a bounding box, a direction (which is useful e.g. if the text has been rotated and is not horizontal), and contains a list of fz_stext_char
s.fz_stext_char
has a Unicode code point, a colour, an origin (i.e. the point at which the start of the glyph's baseline is located), a quad
(which is like a bounding box - except that its sides are not necessarily parallel to the x/y axes), a size and a font.On the C# side, you would have to define a "Block" interface/abstract class, with two implementations ("TextBlock" for blocks containing lists of lines, and "ImageBlock", for blocks containing images). The "Block" interface would define a bounding box property.
A "TextBlock" would contain a list of "TextLine"s, whose properties would be a bounding box, a direction, a string representing the content of the line and possibly a few arrays of attributes of the glyphs (e.g. a list of colours, origins etc). The font structure would probably be too complex to pass it to C# in a useful way without additional libraries. Converting between a Unicode code point a C# char
will probably be fun.
An "ImageBlock" would have the transform matrix as an additional property, and could contain a raw binary representation of the image data. If this is the case, somewhere in the extraction process there must be a flag to avoid collecting this data if not necessary, to avoid the associated waste in memory and time.
Once all the relevant stuff has been passed to managed objects and the raw pointers are not needed, the unmanaged code should free all resources that it has allocated (e.g. the device, text page etc.), which means that some pointers will need to be passed back and forth from managed to unmanaged code to keep track of the references.
All in all this is an interesting problem, and it would probably not be impossible, but it does require a fair amount of work (also to make sure that exceptions are handled correctly, there are no memory leaks etc). I will see if I can have a look when I have time, but I don't know when that will happen 😅
However, extracting the text and bounding boxes from the PDF is only half the work: once you have those, you need to figure out what the user selected, based e.g. on the point where they clicked at the beginning and the current position of the mouse (if they are dragging the selection).
A helper method to figure out to which glyph (of which line of which block) a certain point corresponds to should be easy to write, but once you get the "start glyph" and "end glyph", you need to decide which glyphs are "between" those two... That is easy if the start and end are both on the same line, but it gets tricky if they are on different lines or different blocks (especially if you have text that flows in multiple directions like left-to-right, right-to-left, vertical, rotated by 45 degrees etc).
Then, you need to figure out how to show the selection: you could highlight the text by painting a semi-transparent rectangle in front of it (like SumatraPDF does), but you need to decide the correct shape of the rectangle, as different glyps have different sizes... You could start by drawing a separate rectangle for each glyph, but that would be ugly (and probably slow for large amounts of text); otherwise, you could draw the smallest rectangle that contains all the glyphs in one line, but you need some non-trivial maths to take care of lines with arbitrarily rotated text. Then, you also need a way to "join" overlapping rectangles to avoid the overlap being painted twice - and this is also annoying because the union of two rectangles is not necessarily a rectangle.
For example, look at this screenshot from SumatraPDF:
The word "Acidobacteria" is actually split over six lines (note the number of rectangles that make up the selection shape) and if you try to copy and paste it you get:
Ac
id
ob
ac
te
ria
Quite interestingly, Adobe Reader actually manages to get the copy-paste right, although the way it highlights every glyph seems weird:
All of this breaks down to the fact the the PDF format does not have any notion of a "body of text", because "text" in a PDF is nothing more than a series of individually positioned and painted glyphs. MuPDF (as any other PDF library) uses some heuristics to try and get this right, and these work acceptably well in the most "vanilla" cases, but you cannot rely on them too much in general. Also, I have never been exposed to documents written in anything other than Latin script, but I imagine these issues would be even worse if you are dealing with Middle-Eastern and Asian languages that do not use a simple left-to-right, top-to-bottom layout...
Got it! Thanks for your detailed reply!
Then, you need to figure out how to show the selection: you could highlight the text by painting a semi-transparent rectangle in front of it (like SumatraPDF does), but you need to decide the correct shape of the rectangle, as different glyps have different sizes... You could start by drawing a separate rectangle for each glyph, but that would be ugly (and probably slow for large amounts of text); otherwise, you could draw the smallest rectangle that contains all the glyphs in one line, but you need some non-trivial maths to take care of lines with arbitrarily rotated text. Then, you also need a way to "join" overlapping rectangles to avoid the overlap being painted twice - and this is also annoying because the union of two rectangles is not necessarily a rectangle.
I guess for PDF and HTML alike there's one straightforward way of implementing selection -- just rely on the order of the text block in the serialized representation. Consider a DOM:
<html><body>
<div> <p> Paragraph 1 </p> <div> <p> Paragraph 2 </p> <div> <p> Paragraph 3 </p> </div> </div> </div>
<div> <p> Paragraph 4 </p> </div>
</body></html>
If the hit test says the range spans from paragraph 2 to paragraph 4, then paragraph 3 is included. The textual selection would be the concatenation of these blocks, and the visual representation would be the union of the bounding boxes (maybe relaxed a little bit to allow easier merging).
The good thing is that this implementation is very straightforward. The bad thing is, this is unfortunately why sometimes text selection doesn't work that well, and selects something far away with no other apparent reasons 😅
Yes, that would probably be a sensible way to deal with it.
The problem is always that the text in PDF does not necessarily have to appear in the "source" in the same order as it appears in the finished document (and actually the same is also true with an HML page: for example, you could use CSS styles to move paragraph 4 above paragraph 3 or hide paragraph 2).
However, MuPDF maintains that the blocks it returns should be in "natural reading order", hence I expect that it should be still be possible to obtain reasonable results with this approach - at least in simple cases.
Ok, I think I have managed to get a reasonably decent implementation. v1.2.0 now supports generating a MuPDFStructuredTextPage
containing structured text information (with support for hit-testing, searching, and delimiting text regions). The MuPDFRenderer
now also does text selection and searching. Let me know what you think!
The order of the text is the same as what MuPDF returns, which is apparently (according to a comment in the source code) "the order in which text appears in the file, so may not be accurate". At least, it appears to be the same as SumatraPDF, the Chrome PDF viewer, and Adobe Reader.
I have seen people somewhere suggesting to sort blocks/lines/words/glyphs top-to-bottom and left-to-right (e.g. https://github.com/pymupdf/PyMuPDF/wiki/How-to-extract-text-in-natural-reading-order-(up2down,-left2right) or https://www.tallcomponents.com/pdfkit4/extract-glyphs-from-pdf-and-sort), but this clearly does not work when there are multiple columns or, worse, a single page has both a full-width section as well as a section with columns (e.g. the first page in many scientific papers).
I think it would be an interesting problem to get an AI involved with, but I assume that if neither the developers of MuPDF, nor those at Google, nor those at Adobe have managed to get around this issue, it is certainly way out of my league 😅
COOL!
like SumatraPDF, the text part can be selected and copied. Any advice? :)