facebook / docusaurus

Easy to maintain open source documentation websites.
https://docusaurus.io
MIT License
56.67k stars 8.52k forks source link

Markdown escape characters break indexing (with underscores) #8645

Closed Zamiell closed 1 year ago

Zamiell commented 1 year ago

Have you read the Contributing Guidelines on issues?

Prerequisites

Description

I have some Markdown content like this:

### TEAR\_FALLING\_ACCELERATION

• **TEAR\_FALLING\_ACCELERATION** = ``5``

Corresponds to `CacheFlag.RANGE` (1 << 3) and `EntityPlayer.TearFallingAcceleration`.

For reference, this content was generated by TypeDoc, a popular documentation generation tool. TypeDoc puts escape characters before underscores, because underscores have semantic meaning in Markdown - they transform the text to be either bold or italic. So I would consider this escaping behavior to be "correct" from TypeDoc.

I feed this Markdown content to Docusaurus, and it creates a website for me. The resulting HTML looks like this:

screen_shot_2023-02-08_at_4 51 33_pm

This is strange, and appears to be a bug. I would naively expect that this element should instead simply be TEAR_FALLING_SPEED.

Presumably, this behavior is an artifact of having the escape characters. Visually, the webpage looks fine, as the end user is not able to tell that the text is not actually contiguous. However, when scraping the website with the Algolia/Typesense scraper, it chokes on this content and is not able to index it properly. Thus, when a user searches for "TEAR_FALLING_SPEED", there are no matches, because all the indexer saw was "TEAR", "FALLING", and "SPEED".

As previously mentioned, underscores carry special semantic meaning in Markdown content, but they do not carry any special semantic meaning in HTML content. Thus, I suspect that Docusaurus is doing too much here. Instead of breaking up the content into multi-tokens, it should be able to simply see that there is an unnecessary escape before an underscore, and then remove it.

Your environment

Self-service

Josh-Cena commented 1 year ago

This is an MDX issue. You can try here: https://mdx-git-renovate-babel-monorepo-mdx.vercel.app/playground/ It generates something like:

<h2>{`a`}{`_`}{`b`}{`_`}{`c`}</h2>

Which is multiple text nodes, and when you transform it to HTML, this also results in multiple text nodes.

We have a very similar issue here: https://github.com/facebook/docusaurus/issues/8617

I personally do not see a way we can fix this, and believe it should be a crawler bug. cc @shortcuts

Zamiell commented 1 year ago

I realized today that Prettier is smart enough to remove unnecessary escape characters when formatting Markdown files.

Thus, a solution for my use-case is to insert Prettier into my Docusaurus pipeline. (In other words, I ensure that output from TypeDoc is formatted with Prettier before feeding it to Docusaurus.)

Now, I no longer get the broken up nodes.

I personally do not see a way we can fix this

I'll close the issue then for now, thanks Josh.

slorber commented 1 year ago

Note MDX 2 doesn't seem to create multiple text nodes anymore, so Docusaurus v3 might fix it