CloudCannon / pagefind

Static low-bandwidth search at scale
https://pagefind.app
MIT License
3.22k stars 97 forks source link

Deduplicate results #566

Open nhoizey opened 4 months ago

nhoizey commented 4 months ago

It would be great if we could deduplicate results, for sites where the same content can be present on different pages.

This is already something that requires a canonical for SEO (which is allowed with data-pagefind-meta="url[href]"), so maybe having a boolean option to use the result URL as a deduplication key could be enough.

nhoizey commented 4 months ago

You can for example search for “animal” on https://nicolas-hoizey.photo/search/ and see multiple identical results.

For example, the photo “A storm is coming” is available in 3 different galleries:

They all have the same canonical URL: https://nicolas-hoizey.photo/photos/a-storm-is-coming/ (which I configured for Pagefind, but maybe I shouldn't until it's possible to deduplicate results).

bglw commented 3 months ago

Interesting! I think it would be fine for Pagefind to deduplicate these by default based on their url.

What would you expect regarding the content for these? If you tag three pages with the same url, but they have different content, what should be shown in titles and excerpts? (and what should be indexed for search?) 🤔

nhoizey commented 3 months ago

@bglw in my specific case, title and content are the same anyway, which feels right because they share the canonical URL.

The only differences are:

So the first item with the URL can be used.

But there might be other use cases where the choice would be different, so maybe this could be a set of values:

There might be other values in the future, which an enumeration easily allows.

brokenalarms commented 6 hours ago

came here searching for this, unless I'm missing something the search doesn't seem usable without deduplication. Even if it's possible for the same article to show up once for each tag, it then also shows up a further 3 times under 'tags'? 🤔

tbh I would just expect a list of de-duped matching articles with the tags added as labels on them.

image