CloudCannon / pagefind

Static low-bandwidth search at scale
https://pagefind.app
MIT License
3.34k stars 100 forks source link

Improve indexing and searching for separated words #446

Open mrjbq7 opened 11 months ago

mrjbq7 commented 11 months ago

I have a web page that contains:

fixnum-log2

It would be nice to find this by searching log2

Same for

uniform-random-float

It would be nice to find with "random-float"

bglw commented 11 months ago

Hey @mrjbq7 👋

Are you on a 1.0.X release of Pagefind? 1.0 added exactly this as a feature, so it should already be working!

As an example, search the pagefind.app documentation for “ignore” and you’ll see results appear for “data-pagefind-ignore”

Let me know if you are on latest and not seeing this, ideally with a sample HTML file, and I can dig into it

mrjbq7 commented 11 months ago

I don't want to ignore it, I want to make sure search can access the index. It's not available in the search index for those two examples.

I'm on Pagefind 1.0.3

bglw commented 11 months ago

Sorry, poor choice of example on my part. I didn't mean anything about the ignore setting itself — that was just an example of getting results containing the hyphenated data-pagefind-ignore when searching for ignore.

Do you have a link to a site this is occurring on? Or a larger file you can share? Testing with just those two words I'm seeing the correct behavior.

File:

<html>
<body>

<h1>Sample Page</h1>

<p>fixnum-log2</p>
<p>uniform-random-float</p>

<link href="/pagefind/pagefind-ui.css" rel="stylesheet">
<script src="/pagefind/pagefind-ui.js"></script>
<div id="search"></div>
<script>
    window.addEventListener('DOMContentLoaded', (event) => {
        new PagefindUI({ element: "#search", showSubResults: true });
    });
</script>
</body>

</html>

Command:

npx pagefind@latest --site "." --serve

Results:

Screenshot 2023-09-20 at 3 53 15 PM Screenshot 2023-09-20 at 3 53 54 PM
mrjbq7 commented 11 months ago

You can try those same queries here:

https://re.factorcode.org/

If you search for the full text examples you get the pages correctly but in particular log2 doesn’t find what fixnum-log2 does.

mrjbq7 commented 11 months ago

I copied just the page in question:

https://re.factorcode.org/2013/02/faster-shuffle.html

And run pagefind --site "." --serve using pagefind 1.0.3 and I can't reproduce the issue that I'm having when that page is part of the full website index.

Is there a way of seeing what indexes are accessed locally when I try and search log2 to see why it can't find fixnum-log2?

mrjbq7 commented 11 months ago

Okay, I can reproduce it.

$ mkdir tmp
$ cd tmp
$ wget https://re.factorcode.org/2013/02/faster-shuffle.html
$ wget https://re.factorcode.org/2011/09/really-big-numbers.html
$ pagefind --verbose --site "." --serve

Search for log2 and it's broken when the second html file is part of the index.

mrjbq7 commented 11 months ago

Here's a smaller test case:

Make a index.html:

<html>
<body>

<h1>Sample Page</h1>

<p>fixnum-log2</p>
<p>uniform-random-float</p>

<link href="/pagefind/pagefind-ui.css" rel="stylesheet">
<script src="/pagefind/pagefind-ui.js"></script>
<div id="search"></div>
<script>
    window.addEventListener('DOMContentLoaded', (event) => {
        new PagefindUI({ element: "#search", showSubResults: true });
    });
</script>
</body>

</html>

Then make a page2.html:

<html>
<body>

<h1>Sample Page 2</h1>

<p>log(2)</p>

<link href="/pagefind/pagefind-ui.css" rel="stylesheet">
<script src="/pagefind/pagefind-ui.js"></script>
<div id="search"></div>
<script>
    window.addEventListener('DOMContentLoaded', (event) => {
        new PagefindUI({ element: "#search", showSubResults: true });
    });
</script>
</body>

</html>

Then run pagefind --verbose --site "." --serve

Search for log2 it only finds the page2.html example.

mrjbq7 commented 11 months ago

Single page reproduction:

<html>
<body>

<h1>Sample Page</h1>

<p>fixnum-log2</p>
<p>log(2)</p>

<link href="/pagefind/pagefind-ui.css" rel="stylesheet">
<script src="/pagefind/pagefind-ui.js"></script>
<div id="search"></div>
<script>
    window.addEventListener('DOMContentLoaded', (event) => {
        new PagefindUI({ element: "#search", showSubResults: true });
    });
</script>
</body>

</html>

Search for log2 only highlights the second <p>...

bglw commented 11 months ago

Thanks for the small reproduction case 🙏 very very appreciated.

Looking at fixnum-log2 specifically, the issue is Pagefind is actually splitting beyond what you're after here — it's being split as ["fixnum," "log," "2"], so log2 as a search query isn't directly finding it.

I'll ruminate on this over the weekend and take a look next week when I have a bit more time — potentially we need to index some permutations of words in some case, since it isn't clear what strategy is going to be net better.

i.e. in this case, we would want fixnum and log2 — but if we were indexing something like http404 we would want http and 404.

The other improvement here would be to match some of the word splitting on the search query side, such that searching for random-float also searches for random and float in some capacity — that side of the equation hasn't yet been touched, so it will be currently be searching for randomfloat (and finding the random substring at least).

Good things to think about and improve!

mrjbq7 commented 11 months ago

Hi, I also noticed another related example

searching for avx2 seems to return a lot of highlighted “a” results. I would guess that’s not a good result, in addition to being a common stop word.

mrjbq7 commented 11 months ago

See, for example this blog index:

https://blogs.factorcode.org/slava/

When you first search avx2 it says No results for avx2 then when you backspace to avx then retype avx2 it says 466 results for avx2 and highlights lots of a words.

mrjbq7 commented 9 months ago

Hi @bglw have you had any further ideas on this? Would it help if I offered some time to contribute a fix? Maybe you can point me in the direction of where a fix would go.

bglw commented 9 months ago

No new thoughts yet, but if you're happy to have a tinker it would definitely be appreciated! ❤️

Getting off the ground

Pagefind's CONTRIBUTING.md hasn't been thoroughly vetted, but has been updated recently, so it should hopefully contain all steps that are required to get Pagefind compiling.

Tests

Pagefind is largely integration tested, and this is the main place I spec out new implementations before moving on to the code. The compound word tests currently live in characters.feature, but probably make sense to move to something like a compound_words.feature file. (See Pagefind can search for a hyphenated phrase and Punctuated compound words are indexed per word in the linked file).

A great first step would be writing up some test cases for your ideal behavior in some of the above searches.

To run these tests, from within the pagefind folder:

# Run one test
./test.sh "Pagefind can search for a hyphenated phrase"
# Run all tests
./tesh.sh

The test script will rebuild the pagefind crate, but not the dependencies from other folders.

Pagefind does also have unit tests in applicable areas, e.g. for splitting words: https://github.com/CloudCannon/pagefind/blob/971186cdd062995c14d2cc2bc711b09c850d086f/pagefind/src/fossick/splitting.rs#L57-L64

These are just a standard cargo test.

Relevant locations when indexing

Most word-related logic lives in the fossick module, which is on the refactor list — thankfully, the word splitting itself has been carved out, and can be found in splitting.rs, which itself uses the convert_case crate which nicely handles the logic for camelCase words: https://github.com/CloudCannon/pagefind/blob/971186cdd062995c14d2cc2bc711b09c850d086f/pagefind/src/fossick/splitting.rs#L11-L17

(Not attached to that crate per se, it was a quicker way to get an implementation out at the time, but something more custom is likely required to meet our goals).

This function is used in the root of the fossick module: https://github.com/CloudCannon/pagefind/blob/971186cdd062995c14d2cc2bc711b09c850d086f/pagefind/src/fossick/mod.rs#L363-L388

_NB: get_discrete_words isn't the most optimized — ideally it would return a vector or iterator of word parts, rather than a string that gets re-split. Didn't wind up being super impactful in real performance, so has yet to be looked at._

These two files would be the ideal locations to make changes like indexing fixnum-log2 as fixnum, log, 2, log2.

Relevant locations when searching

I haven't yet thought of the best location to implement any changes here (if required [?]) — so I'll just walk through the meaningful landmarks that a search query touches. Hopefully that helps with direction!

The first place that search normalization happens is in coupled_search.ts: https://github.com/CloudCannon/pagefind/blob/971186cdd062995c14d2cc2bc711b09c850d086f/pagefind_web_js/lib/coupled_search.ts#L405-L408

As such, the WASM query logic doesn't currently see the punctuation — i.e. fixnum-log2 will be given to Pagefind's engine as fixnumlog2. This will need to be addressed (as the TODO implies 😅)

The next place query handling happens is in this function: https://github.com/CloudCannon/pagefind/blob/971186cdd062995c14d2cc2bc711b09c850d086f/pagefind_web/src/search.rs#L377-L390

This is what gives us each word we're querying. The stemming here can be largely ignored, since the same logic happens when indexing. (So running will be stored in the index as run, this function ensures that searching for running actually searches for run).

From there, we take each word that spits out and find a relevant set of pages to intersect our query with: https://github.com/CloudCannon/pagefind/blob/971186cdd062995c14d2cc2bc711b09c850d086f/pagefind_web/src/search.rs#L188-L192

And possibly the most relevant function for you to see is the find_word_extensions function that "finds the relevant set of pages": https://github.com/CloudCannon/pagefind/blob/971186cdd062995c14d2cc2bc711b09c850d086f/pagefind_web/src/search.rs#L347-L374

For a given search term fix, this will return every word index we have that starts with fix. So you'll get back the index for fix, fixnum, fixture, et cetera. (later on, Pagefind's ranking helps to retain the priority of the more exact fix match).

This function also (intentionally) causes the other behavior you were seeing. The word avx2 enters this function, and finds no indexes that match avx*. Instead, the longest matching prefix is returned, which in your case was just the index for the word a. (The general concept is intentional, at least).

Final thoughts

Firstly, no pressure on digging into this! Even just writing this out is a great refresher for myself if I pick this up later 🙂

If you do, feel free to fire through any other questions here. I'm more than happy to help out or pick up any partial solutions.

(NB: This message is getting long, so I didn't address "When you first search avx2 it says No results for avx2 then when you backspace to avx then retype avx2 it says..." — this code is the root of the explanation, but I don't think it's as important to this issue)

mrjbq7 commented 8 months ago

Great information, thanks. I've got the code locally building and added a test case to show the failure, and thinking about some of the various ways to address it. The holidays are a smidge busy but I'll poke at this issue. Not offended if you end up doing it before me, but I hope to also get a patch to you.