CloudCannon / pagefind

Static low-bandwidth search at scale
https://pagefind.app
MIT License
3.47k stars 111 forks source link

Search performance on very large sites #49

Open apkd opened 2 years ago

apkd commented 2 years ago

Hey guys, I've been checking out Pagefind and it works pretty great! I integrated it with my tiny website and it works pretty much flawlessly. The installation was super easy and the UI is slick, fast and lightweight. Lots of fun.

I went ahead and did some testing with a bigger site. I grabbed the largest static website that came to mind, which is the entirety of the Unity manual. It's actually an excellent comparison case because the docs (both the online and offline version) also use client-side search (that needs to grab an enormous ~10MB index).

The bundle is generated very quickly given how many pages there are to index.

$ C:\src\UnityDocumentation\node_modules\.bin\pagefind --source en
Running Pagefind v0.6.1
Running from: "C:\\src\\UnityDocumentation"
Source:       "en"
Bundle Directory:  "_pagefind"
Walking source directory...
Building search indexes...
Did not find a data-pagefind-body element on the site.
ā†³ Indexing all <body> elements on the site.
Indexed 29977 pages
Indexed 99142 words
Indexed 0 filters
Created 225 index chunks
Finished in 41.730 seconds
Done in 42.01s.

I deployed a Pagefind version of the docs search here on GitHub Pages and played around with it a little. The amount of data that PageFind needs to transfer is much smaller, so the core concept definitely works great. However, there's a couple of common cases where the search engine freezes up quite heavily.

For example, here's a Firefox profile of a search for the letter "a". The search engine seems to freeze up for a couple of seconds (on a pretty fast desktop machine). Obviously, a lot of search queries will start with that, so it's a bit of an issue. Other degenerate queries include an, t the, c, t, time, etc - anything that's short enough that it will generate lots of hits. (Again, none of this impacts me currently at all, but it could impact someone eventually...)

Same thing if the user types a special character like space, dot or comma - the engine seems to generate tens of thousands of hits for that. I don't think it's very useful that a query like this generates any hits at all, although maybe this is more of a pagefind-ui issue, not sure.

Finally, I wanted to link the index back to the original documentation website at docs.unity3d.com. Sadly, I was unable to set this up. As I understand, this could be eventually solved by #17, although that seems like a solution to a much bigger problem than mine - If I understand correctly, all I really needed is being able to separate the bundlePath from the URL appended to the hit result link. This could probably be achieved by modifying pagefind-ui, but it would be nice to have a built-in way to handle this.

Take a look at the test page if you want to play with it: https://apkd.github.io/pagefind-benchmark

Thanks!

bglw commented 2 years ago

Huzzah! This is amazing, thank you, having a giant test case on hand is a valuable resource.

The search engine seems to freeze up for a couple of seconds

There's some low-hanging fruit here that I'm aware of, and I'll dig further into a profile on a debug build to identify any other quick wins here. I also have loose plans to move some of the logic into a web worker so that the main thread doesn't lock up in any case.

Same thing if the user types a special character like space, dot or comma - the engine seems to generate tens of thousands of hits for that

That's a wee bug that slipped through release ā€” it's actually normalizing that character out and returning every page that it currently has loaded. Ill make sure to fix that ASAP, it should return no results.

Finally, I wanted to link the index back to the original documentation website at docs.unity3d.com

If I'm following you correctly, I think that setting the baseURL option is what you're looking for? A baseURL: "https://docs.unity3d.com/" should do what you need.


Thanks again, I'll let you know as I make progress on the performance improvements šŸ™‚

apkd commented 2 years ago

Nice!

If I'm following you correctly, I think that setting the baseURL option is what you're looking for? A baseURL: "https://docs.unity3d.com/" should do what you need.

I think I mixed up baseURL for bundlePath there. I did try a couple of settings but never got it to work. For example:

new PagefindUI({element: "#search-unity-docs", bundlePath: "/assets/pagefind-unity-docs/", baseUrl: "https://docs.unity3d.com/", showImages: false});

Causes links to look like this:

<a class="pagefind-ui__result-link svelte-j9e30" href="/https:/docs.unity3d.com/ScriptReference\AssetDatabase.FindAssets.html">AssetDatabase.FindAssets</a>

Which the browser then interprets as a link relative to the site's address: http://127.0.0.1:4000/https://docs.unity3d.com/...

bglw commented 2 years ago

Oh yep, that looks like a bug ā€” I'll fix that in the next release. That baseUrl configuration will be correct

bglw commented 2 years ago

Hi @apkd ā€” a few updates from v0.8.0:

Improving the giant site performance is still on my list. I did make a small change to the webassembly in this release that theoretically improved search performance, but I don't believe to a significant degree (I haven't tested that aspect yet). Larger speed boosts still to come šŸ™‚

raffomania commented 1 year ago

Hey, I've deployed a large page (currently about 180k sites indexed) at archive.observer. I can confirm that searching for small words like 'a' or 'the' makes the whole page freeze for a while.

Another thing with a site of this size is the time it takes to generate the index. Do you see any low hanging fruit for optimizing that? I see that pagefind is maxing out only one core during parsing, maybe a sprinkle of rayon could help? :)

Thanks for your amazing work, pagefind is a really well-polished package and a delight to work with.

bglw commented 1 year ago

Hey @raffomania šŸ‘‹

Glad to hear that it works well enough! That's a lot of content šŸ˜… I can't seem to load that URL, I'm getting a 404. Is that just a me problem?

I think the lowest hanging fruit to improve this would be to use a web worker to reduce the blocking behavior of the search. I can look at this in the near future šŸ¤”

Improving indexing speed might be a little harder ā€” it does (should) multithread in the crawling + parsing stage, but then collapses down to one thread to assemble the final index. I'm sure there are many smart ways this can be improved but I don't think they're low hanging šŸ˜”

raffomania commented 1 year ago

Whoops, my provider had an outage this morning that caused some havoc, it should load now :)

A web worker would certainly be good to have in any case. Maybe there's another way to handle short search strings like "a" that yield a disproportionate number of results, I'm not sure.

Regarding the indexing, I think it's already really fast :)

bglw commented 1 year ago

Maybe there's another way to handle short search strings like "a" that yield a disproportionate number of results

There certainly is! I should find a good way to benchmark the web crate. Off the top of the head, currently it finds all results and also calculates the densest excerpt of matched words up front for all results. I'm impressed-slash-surprised it does so in any reasonable time when returning "168,931 results for "a"" šŸ˜°

Depending on where the bulk of that blocking time lies, offloading that excerpt calculation to either the JS or a subsequent WASM call might help the situation.

There are possibly some other improvements to be discovered, as well. I'll try to set aside some time and dig into it šŸ™‚

Regarding the indexing, I think it's already really fast :)

Nice! Out of interest, how long is that taking?

raffomania commented 1 year ago

That really is impressing! Sound engineering I assume :)

Here's pagefind running for archive.observer on a Ryzen 7 pro 4750U:

time pagefind

Running Pagefind v0.12.0
Running from: "/home/rafael/workspace/web/aharc"
Source:       "output"
Bundle Directory:  "pagefind"

[Walking source directory]
Found 181360 files matching posts/*.html

[Parsing files]
Did not find a data-pagefind-body element on the site.
ā†³ Indexing all <body> elements on the site.

[Reading languages]
Discovered 1 language: en

[Building search indexes]
Total:
  Indexed 1 language
  Indexed 181360 pages
  Indexed 795634 words
  Indexed 0 filters
  Indexed 0 sorts

Finished in 470.682 seconds
pagefind  570.72s user 268.72s system 178% cpu 7:51.37 total
bglw commented 1 year ago

Hi all šŸ‘‹

Some things changes recently that should greatly improve performance here, at least I hope!

A good demo can be found here: https://mdn.pagefind.app/

Testing on the above, a search for a feels performant šŸ™Œ

I'll leave this issue open for now until I get around to testing even larger sites ā€” @apkd if you still have that Unity manual sitting around I'd love it if you could give 1.0 a try and see whether it's noticeably faster ā¤ļø

apkd commented 1 year ago

Hey, I updated the Unity manual demo here. It still feels a bit stuttery (eg. on iOS where even the keyboard haptics actually stop responding for a second). This usually happens when typing queries starting with letters that generate particularly many hits.

haydenflinner commented 1 year ago

Is it plausible to instead interrupt the work if it's taken longer than a second or so, show what's found so far with a little disclaimer that there are a ton of results? Doesn't seem likely to get too many good results from a and etc anyway.

bglw commented 1 year ago

@apkd thanks for updating that! šŸ™ Yes still room to improve ā€” I don't recall how it was behaving but hopefully it's somewhat better now šŸ˜…

Is it plausible to instead interrupt the work if it's taken longer than a second or so, show what's found so far

Interesting thought! Something in that direction might be possible, though the JS API would need to reflect these partial results somehow. I'm hopeful that we can get the total performance good enough to not need that for the sites in this issue, which is probably around the ceiling of sites that I would expect Pagefind to be useful for.