Algolia search hits contain unwanted #docusaurus_skipToContent_fallback anchor

jnikhila commented 1 year ago

Have you read the Contributing Guidelines on issues?

[X] I have read the Contributing Guidelines on issues.

Description

The purpose of #docusaurus_skipToContent_fallback in Docusaurus search results is not currently documented on the Docusaurus website, which can lead to confusion for some users. While it is assumed that the behavior is intended to improve accessibility for users who rely on assistive technology, there is no clear explanation of this feature's functionality.

To address this issue, I recommend adding a mention of #docusaurus_skipToContent_fallback to the Docusaurus documentation.

Self-service

[ ] I'd be willing to address this documentation request myself.

Josh-Cena commented 1 year ago

👋 Anything not documented is considered implementation detail, which is the case here. Why would you specifically want documentation for this?

jnikhila commented 1 year ago

@Josh-Cena - Thank you for your prompt response. While I understand that undocumented features are considered implementation details, having a non-functional anchor in the documentation can be confusing for users, especially when they use the search bar to find relevant information.

Our internal teams also rely on the search functionality to share relevant documentation with end users, and having inaccurate or non-functional anchors can hinder this process. We also have a request raised by our internal team to remove the anchor. However, I wanted to ensure that this would not have any negative repercussions before proceeding. As I'm unable to find any relevant information on Docusaurus documentation, I thought it would be best to consult with you. I would greatly appreciate your help in this matter.

slorber commented 1 year ago

@jnikhila I don't understand what you mean by

having inaccurate or non-functional anchors can hinder this process.

What's the behavior that you experience exactly? Show me a very concrete example, record a video or something demonstrating the problem.

We also have a request raised by our internal team to remove the anchor. However, I wanted to ensure that this would not have any negative repercussions before proceeding.

You can do whatever you want, including implementing your own theme from scratch, but you are on your own then. You become responsible from your customizations that diverge from our official code, and we cannot support you doing so.

As I'm unable to find any relevant information on Docusaurus documentation, I thought it would be best to consult with you.

We tried to make it work automatically out of the box for official plugins but also third-party ones.

The skip to content button will navigate in priority to the first <main> tag, and otherwise will fallback to an element (the Docusaurus layout content wrapper).

/**
 * The id of the element that should become focused on a page
 * that does not have a <main> html tag.
 * Focusing the Docusaurus Layout children is a reasonable fallback.
 */
export const SkipToContentFallbackId = 'docusaurus_skipToContent_fallback';

/**
 * Returns the skip to content element to focus when the link is clicked.
 */
function getSkipToContentTarget(): HTMLElement | null {
  return (
    // Try to focus the <main> in priority
    // Note: this will only work if JS is enabled
    // See https://github.com/facebook/docusaurus/issues/6411#issuecomment-1284136069
    document.querySelector('main:first-of-type') ??
    // Then try to focus the fallback element (usually the Layout children)
    document.getElementById(SkipToContentFallbackId)
  );
}

Going to close because it's more a question about implementation details rather than something we should fix or document.

If you want to know more about implementation details and customize behaviors, you should read the source code.

jnikhila commented 1 year ago

@slorber - Thank you for your response. I want to clarify that we have not made any customizations to the default behavior of Docusaurus. We are using v 2.4.0, using Algolia DocSearch for search capabilities. Despite this, we have noticed that an anchor tag (#docusaurus_skipToContent_fallback) gets appended to the URL whenever a user selects a search result and navigates to that page.

sample URL - https://sample-docs.com/docs/getting-started/installation**#docusaurus_skipToContent_fallback**

I searched Docusaurus documentation and have not found any information about this behavior. That's when I raised a query request to understand if this was missing from the documentation and if there were any possible ways to remove the anchor tag getting appended to the URL.

Just to clarify, I understand that the ticket was closed because it's a question. However, I appreciate your response and the information provided. I do not want to reopen the ticket and would like to keep it closed. Thank you for your help.

slorber commented 1 year ago

@jnikhila sample-docs.com is not a real domain so I can't double check the behavior, and can't reproduce either on our own website at https://docusaurus.io/

Give me a real production URL, record a video of the problem, a repro, or anything else that is "concrete". Not just text, something that I can run.

Otherwise, it's not actionable from my side.

jnikhila commented 1 year ago

@slorber - Thank you for your response.

The issue can be seen on our website.

Not sure if this background detail could help you: Before upgrading to Docusaurus v2.4.0, we did not encounter the issue. We're on version 2.1.0 and upgraded to use the queryString attribute for Tabs and that's when we started to see this happening for search results.

slorber commented 1 year ago

@jnikhila

I want to clarify that we have not made any customizations to the default behavior of Docusaurus.

Your site clearly shows customizations. Maybe revert these customizations until you find the reason. Docusaurus 2.4 does not have this bug by default as far as I know.

If you have swizzled components, this is likely the cause.

slorber commented 1 year ago

Thanks, as your site is open-source (https://github.com/appsmithorg/appsmith-docs/blob/main/website) I was able to inspect and wasn't able to find customizations that could cause such a problem.

It turns out that it is indeed a bug, that I can reproduce on our own website in some edge cases:

CleanShot 2023-04-14 at 15 14 48@2x

The Algola search hit already contains the id:

I am still unable to understand why this happens to certain search hits and not all of them. Among results not linking to a specific heading, we can find some hits with/without that unwanted anchor.

In our case, this seems to happen on Markdown pages, like this one: https://docusaurus.io/examples/markdownPageExample#docusaurus_skipToContent_fallback

We probably need to upgrade our Algolia index configuration to ensure the anchor is never added to such search hits.

@shortcuts any idea what is happening here?

slorber commented 1 year ago

@jnikhila it's worth posting your Algolia crawler config here too.

Maybe it's not up to date with the one we recommend? https://docusaurus.io/docs/search#algolia-index-configuration

I feel like the behavior is different on your website compared to the Docusaurus website.

jnikhila commented 1 year ago

@slorber - We are not using the Algolia Crawler for updating our index.

We are still using the DocSearch Scrapper for updating it. We created the configuration file using the configuration available at the Algolia Github repo for DocSearch. Here's a link to the configuration file on our side.

We had earlier amended the config to add additional info to be indexed for searches like error messages. I realized we had removed the "url_without_anchor" attribute from the "attributesToRetrieve" array. I did a quick fix and verified it. Unfortunately, that has not fixed the issue.

As you pointed out, the issue only occurs for some of the Urls related to markdown files at your end. It makes sense that it is happening for all our doc Urls, as we use markdown files for every doc.

shortcuts commented 1 year ago

Hey @slorber, @jnikhila!

The anchor attached to the hit is not related to the configuration but how we extract the data with the crawler (and scrapper) directly. Just to give you a bit of context, we try to get the closest anchor by finding a parent (or a sibling if no parent are found) with an anchor so we can leverage it in the search modal.

Here my guess is that we don't find any relevant anchor, and end up on the skipToContent one.

Should the crawler (and scraper) have found a more relevant anchor? If so, then it's a bug on our side, otherwise we can just tell our crawler to ignore them for Docusaurus websites if you want.

shortcuts commented 1 year ago

@slorber - We are not using the Algolia Crawler for updating our index.

as you may know, the DocSearch scraper have been archived, so it's normal that it differs from the Docusaurus results as they don't leverage the same crawler.

you can however stil create a fork and update the anchor guessing strategy, which is available here: https://github.com/algolia/docsearch-scraper/blob/master/scraper/src/strategies/anchor.py#L16

slorber commented 1 year ago

Thanks @shortcuts !

Should the crawler (and scraper) have found a more relevant anchor? If so, then it's a bug on our side, otherwise we can just tell our crawler to ignore them for Docusaurus websites if you want.

What I don't understand in our own website case is how the crawler decides or not to use an id.

No id (good):

Search: https://docusaurus.io/search?q=Happy-First-Birthday-Slash
First result (good): https://docusaurus.io/blog/2018/12/14/Happy-First-Birthday-Slash

CleanShot 2023-04-19 at 17 13 04@2x

Bad id

This is the only case I was able to find in our own website:

Search: https://docusaurus.io/search?q=do
First result (bad): https://docusaurus.io/examples/markdownPageExample#docusaurus_skipToContent_fallback

CleanShot 2023-04-19 at 17 32 22@2x

It's not clear to me why the last case Algolia uses an id and not in all the other cases, any idea?

Both pages have quite similar structures:

skip to content link
skip to content fallback anchor
<main> section, more or less nested

slorber commented 1 year ago

Does it make sense that we prefix this fallback anchor with __ ?

Looks like a convention you are already using in the older crawler, wonder where it comes from.

shortcuts commented 1 year ago

It's not clear to me why the last case Algolia uses an id and not in all the other cases, any idea?

Both pages have quite similar structures:

skip to content link skip to content fallback anchor

section, more or less nested

We don't go all the way up in the tree, we stop at some point (I need to deep dive in it again to give you more details), but IIRC we stop searching after something 5 levels around the current element (arbitrary number of course) because otherwise we might end-up with non-relevant anchors

Does it make sense that we prefix this fallback anchor with __ ?

Looks like a convention you are already using in the older crawler, wonder where it comes from.

If you think there might be other anchors we don't want to index later on, it's prefix it indeed, otherwise we can just ignore it internally.

slorber commented 1 year ago

If you think there might be other anchors we don't want to index later on, it's prefix it indeed, otherwise we can just ignore it internally.

There are some other anchors but only this one is IMHO likely to cause troubles.

Will add the __prefix to all of them anyway just in case.

As far as I understand you probably choose to ignore anchors with that prefix for search because many React apps are using an id at the root (including us) such as __docusaurus, __next, ___gatsby already.

Let's keep that convention then. Other search crawlers can also respect it if it's already in use.

shortcuts commented 1 year ago

Just pushed a fix on our side, will deploy it tomorrow in prod!

slorber commented 1 year ago

Thanks, also added the __ prefix, it's deploying to prod now. If you run the crawler in less than 2 hours it should already fix prod.

We'll never know if your fix of tomorrow also works 😅

jnikhila commented 1 year ago

as you may know, the DocSearch scraper have been archived, so it's normal that it differs from the Docusaurus results as they don't leverage the same crawler. you can however stil create a fork and update the anchor guessing strategy, which is available here: https://github.com/algolia/docsearch-scraper/blob/master/scraper/src/strategies/anchor.py#L16

@slorber and @shortcuts - Do we need to make changes to the scrapper at our end to fix this?

shortcuts commented 1 year ago

@slorber deployed to prod, it should now properly ignore anchors starting with __

@jnikhila once the next version of Docusaurus is released it should be good on your side

jnikhila commented 1 year ago

Thank you, @shortcuts and @slorber helping here.

slorber commented 1 year ago

Thanks @shortcuts

I confirm after running the crawler on our site that the problem is fixed:

CleanShot 2023-04-21 at 11 26 19@2x

facebook / docusaurus