Comments can end up displaying inside description meta

jmynes commented 2 years ago

Have you read the Contributing Guidelines on issues?

[X] I have read the Contributing Guidelines on issues.

Prerequisites

[X] I'm using the latest version of Docusaurus.
[X] I have tried the npm run clear or yarn clear command.
[X] I have tried rm -rf node_modules yarn.lock package-lock.json and re-installing packages.
[ ] I have tried creating a repro with https://new.docusaurus.io.
[X] I have read the console error message carefully (if applicable).

Description

Issue: My .md or .mdx files are loading comments into the content of the meta property og:description <meta property="og:description" content="<!-- This comment WILL NOT be ignored:" data-react-helmet="true">

Steps to reproduce

Example code structure to replicate this issue:

---
slug: /link_here
title: Title text here
---

<!--
  This comment will be ignored:
-->
<head>
  <meta name="description" content="This description will work correctly" />
</head>

<!-- This comment WILL NOT be ignored:
 - Point 1
 - Point 2
 -  etc -->

Page text here, lorem ipsum etc etc etc

## h2, etc etc etc

Expected behavior

og:description should not be pulling comments in to fill its content

Actual behavior

og:description seems to pull the first text after even if it's a comment?

No log messages output.

Your environment

Docusaurus version used: 2.0.0-beta.13, & 2.0.0-beta.9
Environment name and version (e.g. Chrome 89, Node.js 16.4): Edge Version 96.0.1054.57 (Official build) (64-bit)
Operating system and version (e.g. Ubuntu 20.04.2 LTS): Windows 10 Pro - 21H2

I have already fixed this issue in prod on our site's affected page, https://docs.domination.finance/liquidity_providers, by manually overriding the meta tag for og:description

Reproducible demo

No response

Self-service

[ ] I'd be willing to fix this bug myself.

Josh-Cena commented 2 years ago

Can reproduce and confirmed it's not an MDX problem 👍 Will look into this

Josh-Cena commented 2 years ago

Haha, @jmynes I know what's wrong.

The default description generator just takes the first line from the text and tries to sanitize it into plain text. However, multiline comments that only contain the opening directive on this line aren't ignored (bug here!).
When you manually specify the description meta, that property gets overridden successfully. However, it's not synced with og:description (for good reason!).
Still, I'm unsure why you are getting content="<!-- This comment WILL NOT be ignored:". I'm getting content="<!--" because that's literally the first line in the text. Could you confirm that's the case and you just made a mistake in making the demo?

Josh-Cena commented 2 years ago

@slorber I'm quite tired of patching the Markdown parsing logic here and there. Could we just make a Remark plugin to do this? The same goes for #5659.

slorber commented 2 years ago

Hey,

I'm sorry but I don't really understand this issue, it seems a bit messy to me.

Please be aware that the meta description that we infer from your content is only a best-effort, and it's preferable to use an explicit description.

You can do that in 2 ways:

provide a description: my description frontmatter (sync to both description + og:description
provide your own metas to override default ones, but you'll have to declare all of them, not just description:

<head>
  <meta name="description" content="This description will work correctly" />
  <meta property="og:description" content="This description will work correctly" />
</head>

@Josh-Cena this isn't really worth it to invest fixing the default, we should mostly recommend to always use a description frontmatter for better SEO control. We try to provide good defaults but ultimately it will never be perfect.

Josh-Cena commented 2 years ago

No, the problem is:

We have a very complicated Markdown parsing logic for the file path -> URL linkification + generating description already.
This is ultimately very ad-hoc and never addresses all edge cases because we just use simple regex tests.
We already have run into some troubles that led to us essentially needing to re-implement remark (#5670).

My proposal is that we get rid of the makeshift markdownParser entirely and build everything as part of the MDX loading process. e.g. we can simply export the plain description string as another MDX export and render it client-side.

Josh-Cena commented 2 years ago

Patching the description generation is easy: just try to ignore multi-line comments like we do for fenced code blocks, like a state machine. The problem is, where do we end?

Josh-Cena commented 2 years ago

But yes, @jmynes I failed to mention that you can set a front matter instead of using the <head> metadata tag for the description, which will be applied for both description and og:description. However, the issue that this bug report surfaces is legitimate.

slorber commented 2 years ago

My proposal is that we get rid of the makeshift markdownParser entirely and build everything as part of the MDX loading process. e.g. we can simply export the plain description string as another MDX export and render it client-side.

That wouldn't really allow being more compatible with CommonMark. export const description wouldn't be usable by users that opt out of MDX on purpose.

In the end, using content as SEO description can't work in all situations. Even if we have the perfect code to skip comments, ultimately the first sentence may not be a good text for this use-case. We could as well totally disable this fallback and always ask users for an explicit description.

Not handling comments correctly can be considered as a bug, but it's a low priority one considering using description inference is not the best idea in the first place and we should encourage to use an explicit description. Still, if you find a way to improve the inference, why not, but for me the workarounds are good enough

Josh-Cena commented 2 years ago

That wouldn't really allow being more compatible with CommonMark. export const description wouldn't be usable by users that opt out of MDX on purpose.

It's not about CM compatibility. It's about ridding the pain to support a custom Markdown parser that falls apart on many edge cases.

I agree with two things:

The default impl doesn't have to be perfect, maybe not even have to make sense;
We should encourage users to provide SEO information explicitly.

However, we can easily refactor this into a remark plugin. I would probably raise a POC later

slorber commented 2 years ago

However, we can easily refactor this into a remark plugin. I would probably raise a POC later

Agree that reading things from content with Regexp is not the best. A remark plugin can be nice to extract that description (and blog excerpts probably too).

In which case it's probably a nice idea to ensure that we don't parse the whole doc as AST and the visitor can bail-out asap, for perf reasons. I don't know if this remark plugin could run as part of mdx-loader (it means description wouldn't be part anymore of doc metadatas)

Josh-Cena commented 2 years ago

Mmm, we can certainly use Remark outside of MDX. But we'll see if we run into any trouble with the lack of description as metadata

Josh-Cena commented 2 years ago

@slorber I've spun up a PoC in https://github.com/facebook/docusaurus/pull/5670. This would solve both the linkification and excerpt problem.

facebook / docusaurus