Markdown escape characters break indexing (with underscores)

Zamiell commented 1 year ago

Have you read the Contributing Guidelines on issues?

[X] I have read the Contributing Guidelines on issues.

Prerequisites

[X] I'm using the latest version of Docusaurus.
[X] I have tried the npm run clear or yarn clear command.
[X] I have tried rm -rf node_modules yarn.lock package-lock.json and re-installing packages.
[X] I have tried creating a repro with https://new.docusaurus.io.
[X] I have read the console error message carefully (if applicable).

Description

I have some Markdown content like this:

### TEAR\_FALLING\_ACCELERATION

• **TEAR\_FALLING\_ACCELERATION** = ``5``

Corresponds to `CacheFlag.RANGE` (1 << 3) and `EntityPlayer.TearFallingAcceleration`.

For reference, this content was generated by TypeDoc, a popular documentation generation tool. TypeDoc puts escape characters before underscores, because underscores have semantic meaning in Markdown - they transform the text to be either bold or italic. So I would consider this escaping behavior to be "correct" from TypeDoc.

I feed this Markdown content to Docusaurus, and it creates a website for me. The resulting HTML looks like this:

This is strange, and appears to be a bug. I would naively expect that this element should instead simply be TEAR_FALLING_SPEED.

Presumably, this behavior is an artifact of having the escape characters. Visually, the webpage looks fine, as the end user is not able to tell that the text is not actually contiguous. However, when scraping the website with the Algolia/Typesense scraper, it chokes on this content and is not able to index it properly. Thus, when a user searches for "TEAR_FALLING_SPEED", there are no matches, because all the indexer saw was "TEAR", "FALLING", and "SPEED".

As previously mentioned, underscores carry special semantic meaning in Markdown content, but they do not carry any special semantic meaning in HTML content. Thus, I suspect that Docusaurus is doing too much here. Instead of breaking up the content into multi-tokens, it should be able to simply see that there is an unnecessary escape before an underscore, and then remove it.

Your environment

Public source code: https://github.com/IsaacScript/isaacscript
Public site URL: https://isaacscript.github.io/
Docusaurus version used: 2.3.1

Self-service

[ ] I'd be willing to fix this bug myself.

Josh-Cena commented 1 year ago

This is an MDX issue. You can try here: https://mdx-git-renovate-babel-monorepo-mdx.vercel.app/playground/ It generates something like:

<h2>{`a`}{`_`}{`b`}{`_`}{`c`}</h2>

Which is multiple text nodes, and when you transform it to HTML, this also results in multiple text nodes.

We have a very similar issue here: https://github.com/facebook/docusaurus/issues/8617

I personally do not see a way we can fix this, and believe it should be a crawler bug. cc @shortcuts

Zamiell commented 1 year ago

I realized today that Prettier is smart enough to remove unnecessary escape characters when formatting Markdown files.

Thus, a solution for my use-case is to insert Prettier into my Docusaurus pipeline. (In other words, I ensure that output from TypeDoc is formatted with Prettier before feeding it to Docusaurus.)

Now, I no longer get the broken up nodes.

I personally do not see a way we can fix this

I'll close the issue then for now, thanks Josh.

slorber commented 1 year ago

Note MDX 2 doesn't seem to create multiple text nodes anymore, so Docusaurus v3 might fix it

facebook / docusaurus