CloudCannon / pagefind

Static low-bandwidth search at scale
https://pagefind.app
MIT License
3.22k stars 97 forks source link

Idea: should skip `[hidden]` by default to reflect crawlers behavior #580

Closed JulianCataldo closed 3 months ago

JulianCataldo commented 3 months ago

Hello,

Pagefind has built-in elements that are not indexed. These are organizational elements such as <nav> and <footer>, or more programmatic elements such as <script> and <form>. These elements will be skipped over automatically.

I know I could just use --exclude-selectors, but I think Google and other search engines crawlers are already ignoring hidden content for their search results, with [hidden] and display: none.

With Pagefind, [hidden] should be equivalent to [data-pagefind-ignore] IMHO.

What do you guys think?

Thanks again for maintaining this great tool!

bglw commented 3 months ago

Hey @JulianCataldo 👋

I've just done a test of this, and Google does not respect the [hidden] attribute, so Pagefind currently has parity with Google's indexing.

Here's a test site: https://testing-url.com/

On the page you'll see Pagefind test content that is not hidden, and not see Pagefind test content that is marked as hidden that is marked as hidden below it:

<p class="subtext">Pagefind test content that is not hidden</p>

<p class="subtext" hidden>Pagefind test content that is marked as hidden</p>

Indexing this with Google, I can search for the hidden text, and the hidden text is also used in Google's description:

Screenshot 2024-03-22 at 10 07 25 AM

You should be able to reproduce this by searching for "Pagefind test content that is marked as hidden" in quotes on Google. (Other engines haven't indexed this page yet).

Given this, I'm happy with how Pagefind is treating this content by default. But thanks for the issue! It was good to walk through and validate 🙂

JulianCataldo commented 3 months ago

Very interesting walk-through. This testing-url.com will become handy thx. I wasn't 100% sure that Google was ignoring [hidden] content.

Indeed, it looks like the hidden content (not necessarily with the "hidden" attr.), when it's in an accordion, tooltip etc. is "de-prioritized" by Google, not entirely ignored. So it's good to know anyway.

Hiding content within tabs, accordions, or other elements that rely on JavaScript to reveal it to users is likely to be treated differently by Google and assigned far less importance Website owners must take a considered approach and use this method only to hide content that is of secondary importance to the primary topic of the page, or that covers related topics

source