ArtifexSoftware / mupdf.js

JavaScript bindings for MuPDF
https://mupdfjs.readthedocs.io
GNU Affero General Public License v3.0
386 stars 22 forks source link

Implement "Match Case" and "Whole Words" search #110

Open xybei opened 2 months ago

xybei commented 2 months ago

The search function can not match line-wrapped words (as pdf.js does in the figure below). It also does not support "Match Case" or "Whole Words". Hope it can be improved, thanks!

test.pdf

image

jamie-lemon commented 2 months ago

Agree there is no case sensitivity for the search at present. However I do believe MuPDF.js detects line-wrapped words, If I use your file with: let results = page.search("Hello world")

It delivers 3 results as follows:

[
    [
        [
            72,
            75.22499084472656,
            149.67677307128906,
            75.22499084472656,
            72,
            91.24498748779297,
            149.67677307128906,
            91.24498748779297
        ]
    ],
    [
        [
            287.1199951171875,
            75.22499084472656,
            360.5367431640625,
            75.22499084472656,
            287.1199951171875,
            91.24498748779297,
            360.5367431640625,
            91.24498748779297
        ]
    ],
    [
        [
            505.17999267578125,
            75.22499084472656,
            540.9913940429688,
            75.22499084472656,
            505.17999267578125,
            91.24498748779297,
            540.9913940429688,
            91.24498748779297,
            72,
            96.28498840332031,
            109.5767593383789,
            96.28498840332031,
            72,
            112.30498504638672,
            109.5767593383789,
            112.30498504638672
        ]
    ]
]

These are QuadPoints which represent the areas with the text found ( see: https://mupdfjs.readthedocs.io/en/latest/how-to-guide/node/document/index.html#searching-a-document ).

I also tested here: https://casper.mupdf.com/wasm/demo/ and uploaded your test.pdf file and performed a search for "hello world" the UI then highlighted these areas on the document:

Screenshot 2024-08-28 at 13 09 34

So I think the search method does find of the three instances that you expect with the correct bounding box data with the QuadPoints!

xybei commented 2 months ago

Sorry, I made a mistake. MuPDF.js does support searching for line-wrapped words.