RFC: CommonMark compatibility, supporting multiple markdown/content parsers

slorber commented 4 years ago

💥 Proposal

People using Docusaurus don't always like the MDX parser:

If you come from an existing Markdown docs base (like v1), you need to make it compatible with MDX, despite that you actually don't plan to embed any JSX components in the markdown
You might want to keep compatibility to CommonMark, to stay compatible with existing ecosystem (Github md viewer, markdownlint etc...)
It creates more "lock-in", because to leave MDX you have to convert back to CommonMark
It can be confusing to not be able to use CommonMark (ie html tags, not jsx) in .md files, and to learn that even .md files are parsed with MDX

Related discussions:

Solution ?

These libs:

is also based on UnifiedJS ecosystem
allows to pass custom React elements to replace existing tags

We may be able to build some shared abstraction on top of react-markdown + MDX.

If this works, we could switch from one parser to another with a simple switch/setting, that could be:

.md -> common-mark compatible parser
.mdx -> MDX
global default D2 parser setting
parser frontmatter

The idea would be that, if a doc does not embed any html/jsx, we could switch from one parser to the other, and shouldn't notice any change.

--

Feedbacks welcome

slorber commented 4 years ago

Sidenote:  marker used for blog summaries will likely not work in MDX 2:

Edit: I think it will still work because it's processed before mdx compilation

borekb commented 4 years ago

That sounds excellent!

Thinking about the common abstraction you mentioned, and assuming that this is still the overarching goal:

Beyond that, Docusaurus 2 is a performant static site generator and can be used to create common content-driven websites (e.g. Documentation, Blogs, Product Landing and Marketing Pages, etc) extremely quickly.

I think that Docusaurus could document an interface for plugins / formats / loaders (I don't know how to call them) that could possibly look like this:

At the base level, the format should be able to produce an HTML output, i.e., an HTML string. For example, if I have a .txt file, I'd be able to write a "format" that produces <pre>... contents of the txt file ...</pre>. Since this is just a string, Docusaurus wouldn't operate on it in any way, just display it.
Smarter formats would return some sort of AST (or JSX or whatever would be suitable). For example, if I wanted to implement .md that turns code blocks to live playgrounds, like React Styleguidist does, I'd be able to do that.

Some wilders use cases this would cover (I actually had them in the past):

A marketing team uses headless WordPress to maintain the contents of landing pages.
Feature comparison / grid is maintained in a Google Sheet.
.md files use some sort of Markdown dialect, for example, the site used to be powered by MkDocs and uses Python-Markdown plus a couple of custom extensions.

For this RFC, I think it's more than enough to support CommonMark but since I've now spent some time thinking about how we'd use Docusaurus and what I'd love it to allow me to do, I thought I'd post it here.

Thanks a lot for this RFC and all the work that goes into Docusaurus!

borekb commented 4 years ago

@slorber I'd like to create a prototype of CommonMark support but am unfamiliar with Docusaurus codebase so would really appreciate high-level guidelines if you will.

Roughly speaking, if I wanted to parse .md files as CommonMark, which parts of the codebase I'd need to touch? I can overwrite the code for now in a fork, i.e., it's not my ambition yet to make this a general solution supporting both MDX and CommonMark, I just want to see what's the minimal set of changes to swap the MDX parser for something like remark.

Any hints appreciated 🙏 .

slorber commented 4 years ago

Hi,

My first intuition would be to modify "docusaurus-mdx-loader", and provide a loader option to tell it to load the files as md or mdx. In the end, we need a React component anyway, using MDX, but in md mode, we could convert the html elements to JSX elements just before feeding mdx, so that mdx is happy?

Not sure, this would require some experiments to see if this is possible

borekb commented 4 years ago

Thanks a lot, I'll give it a go later this week or the next one.

borekb commented 4 years ago

plugin-content-docs that supports CommonMark for `.md` files

We've experimented with plain Markdown support in an internal prototype and wanted to post the key results here.

Summary

It's doable and not that complex – about 200 LoC. There's currently some ugliness like to get the ToC, we're converting AST to React components and then parsing it back to a string for which we didn't find a better solution yet but I'm sure there should be, e.g. something like hast-util-to-jsx if it was maintained.

How it's done

The rewritten plugin-content-docs-2/index.ts customizes loaders:

For .mdx files, use @docusaurus/mdx-loader
For .md files, use a custom loader (see below).

In our prototype, we first duplicate ~20 LoC from the base implementation and then customize the loaders. The entire file (certainly with opportunities for further cleanup) looks like this:

import path from 'path';

import admonitions from 'remark-admonitions';
import {STATIC_DIR_NAME} from '@docusaurus/core/lib/constants';
import {
  docuHash,
  aliasedSitePath,
} from '@docusaurus/utils';
import {
  LoadContext,
  Plugin,
  OptionValidationContext,
  ValidationResult,
} from '@docusaurus/types';

import loadEnv from '@docusaurus/plugin-content-docs/lib/env';

import {
  PluginOptions,
  LoadedContent,
  SourceToPermalink,
} from '@docusaurus/plugin-content-docs/lib/types';
import {Configuration} from 'webpack';
import {VERSIONS_JSON_FILE} from '@docusaurus/plugin-content-docs/lib/constants';
import {PluginOptionSchema} from '@docusaurus/plugin-content-docs/lib/pluginOptionSchema';
import {ValidationError} from '@hapi/joi';

import * as originalPluginContentDocs from '@docusaurus/plugin-content-docs';

export default function pluginContentDocs(
  context: LoadContext,
  options: PluginOptions,
): Plugin<LoadedContent | null, typeof PluginOptionSchema> {

  if (options.admonitions) {
    options.remarkPlugins = options.remarkPlugins.concat([
      [admonitions, options.admonitions],
    ]);
  }

  const {siteDir, generatedFilesDir} = context;
  const docsDir = path.resolve(siteDir, options.path);
  const sourceToPermalink: SourceToPermalink = {};

  const dataDir = path.join(
    generatedFilesDir,
    'docusaurus-plugin-content-docs',
    // options.id ?? 'default', // TODO support multi-instance
  );

  // Versioning.
  const env = loadEnv(siteDir, {disableVersioning: options.disableVersioning});
  const {versioning} = env;
  const {
    docsDir: versionedDir,
  } = versioning;

  const result = originalPluginContentDocs.default(context, options);
  result.configureWebpack = function (_config, isServer, utils) {
    const {getBabelLoader, getCacheLoader} = utils;
    const {rehypePlugins, remarkPlugins} = options;
    // Suppress warnings about non-existing of versions file.
    const stats = {
      warningsFilter: [VERSIONS_JSON_FILE],
    };

    return {
      stats,
      devServer: {
        stats,
      },
      resolve: {
        alias: {
          '~docs': dataDir,
        },
      },
      module: {
        rules: [
          {
            test: /(\.mdx)$/,
            include: [docsDir, versionedDir].filter(Boolean),
            use: [
              getCacheLoader(isServer),
              getBabelLoader(isServer),
              {
                loader: require.resolve('@docusaurus/mdx-loader'),
                options: {
                  remarkPlugins,
                  rehypePlugins,
                  staticDir: path.join(siteDir, STATIC_DIR_NAME),
                  metadataPath: (mdxPath: string) => {
                    // Note that metadataPath must be the same/in-sync as
                    // the path from createData for each MDX.
                    const aliasedSource = aliasedSitePath(mdxPath, siteDir);
                    return path.join(
                      dataDir,
                      `${docuHash(aliasedSource)}.json`,
                    );
                  },
                },
              },
              {
                loader: path.resolve(__dirname, './markdown/index.js'),
                options: {
                  siteDir,
                  docsDir,
                  sourceToPermalink,
                  versionedDir,
                },
              },
            ].filter(Boolean),
          },
          {
            test: /(\.md)$/,
            include: [docsDir, versionedDir].filter(Boolean),
            use: [
              getCacheLoader(isServer),
              getBabelLoader(isServer),
              {
                loader: path.resolve(__dirname, './custom-md-loader/index.js'),
                options: {
                  remarkPlugins,
                  rehypePlugins,
                  staticDir: path.join(siteDir, STATIC_DIR_NAME),
                  metadataPath: (mdxPath: string) => {
                    // Note that metadataPath must be the same/in-sync as
                    // the path from createData for each MDX.
                    const aliasedSource = aliasedSitePath(mdxPath, siteDir);
                    return path.join(
                      dataDir,
                      `${docuHash(aliasedSource)}.json`,
                    );
                  },
                },
              },
              {
                loader: path.resolve(__dirname, './markdown/index.js'),
                options: {
                  siteDir,
                  docsDir,
                  sourceToPermalink,
                  versionedDir,
                },
              },
            ].filter(Boolean),
          },
        ],
      },
    } as Configuration;
  }

  return result;
}

export function validateOptions({
  validate,
  options,
}: OptionValidationContext<PluginOptions, ValidationError>): ValidationResult<
  PluginOptions,
  ValidationError
> {
  return originalPluginContentDocs.validateOptions({validate, options});
}

The there's a custom loader – plugin-content-docs-2/src/custom-md-loader/index.ts. It looks like this in full:

import {loader} from 'webpack';
import {getOptions} from 'loader-utils';
import {readFileSync} from 'fs-extra';
import matter from 'gray-matter';
import stringifyObject from 'stringify-object';
import unified from 'unified';
import parse from 'remark-parse';
import remark2rehype from 'remark-rehype';
import rehype2react from 'rehype-react';
import React from 'react';
import rightToc from '@docusaurus/mdx-loader/src/remark/rightToc';
import slug from 'remark-slug';
import raw from 'rehype-raw';
import emoji from 'remark-emoji';
import admonitions from 'remark-admonitions';
import headings from 'rehype-autolink-headings';
import highlight from '@mapbox/rehype-prism';
import reactElementToJSXString from 'react-element-to-jsx-string';

const mdLoader: loader.Loader = function (fileString) {
  const callback = this.async();

  const {data, content} = matter(fileString);

  const options = getOptions(this) || {};

  let exportStr = `export const frontMatter = ${stringifyObject(data)};`;
  // Read metadata for this MDX and export it.
  if (options.metadataPath && typeof options.metadataPath === 'function') {
    const metadataPath = options.metadataPath(this.resourcePath);
    if (metadataPath) {
      // Add as dependency of this loader result so that we can
      // recompile if metadata is changed.
      this.addDependency(metadataPath);
      const metadata = readFileSync(metadataPath, 'utf8');
      exportStr += `\nexport const metadata = ${metadata};`;
    }
  }

  const processedMd = unified()
    .use(parse, {commonmark: true})
    .use(slug)
    .use(emoji)
    .use(admonitions)
    .use(rightToc)
    .use(remark2rehype, {allowDangerousHtml: true})
    .use(raw)
    .use(headings)
    .use(highlight)
    .use(rehype2react, {createElement: React.createElement, Fragment: React.Fragment})
    .processSync(content);

  const jsxString = reactElementToJSXString((processedMd as any).result);

  // I don't like this at all, but it's a prototype...
  // We need to get 'rightToc' data from the JSX string, so following lines
  // are about getting the info and then replacing it, along with escaping unwanted chars.
  const rightTocString = jsxString
    .match(/(export const rightToc = \[[\s\S.]*\];)/)![1]
    .replace(/(\\n)|(\\t)|(\\)/g, '');

  const escapedJsxString = jsxString
    .replace(/{\`[\S\s.]*?export const rightToc = \[[\s\S.]*\];[\S\s.]*?\`}/, '')
    .replace(/{'[\s\S]*?'}/g, `{' '}`)
    .replace(/`/g, '\`');

  const code = `
  import React from 'react';

  ${rightTocString}
  ${exportStr}

  export default function MDLoader() {
    return (${escapedJsxString});
  }
  `;

  return callback && callback(null, code);
};

export default mdLoader;

If there wasn't the ugly React to string parsing code, it would actually be quite simple.

The downside from the maintenance point of view is that the MD loader is explicit about its unified.js plugins while the MDX loader is a bit more indirect / obscure, so there would be two places to maintain this configuration. But I think this could be refactored to be more aligned, and even in the worst case, it's like 15 lines of code and the default set of plugins probably isn't changing that often.

Overall, it seems feasible to me.

borekb commented 4 years ago

An alternative approach would be to convert MD to MDX first and then just let the mdx-loader to its thing. But there probably isn't currently a convertor from MD to MDX in the unified ecosystem, though many pieces are in place: https://github.com/unifiedjs/ideas/issues/9.

slorber commented 4 years ago

thanks for those details, that looks interesting. If MDX provided a converter that would be great, also would helpful for v1->v2 migrations

I don't have much time to explore these ideas but we'll come back to it someday.

Note, not sure it's related, but there's a large docs plugin refactor here: https://github.com/facebook/docusaurus/pull/3245

nilsocket commented 3 years ago

Is it possible to have something simple, which works out of the box.

I need math blocks, I see MDX documentation, it's too messy and complicated.

Docusaurus seems to work on the basic assumptions or at-least targeted to only those users who are front-end developers, know JSX, React, ...

or

Is there a simple way to get math blocks support.

Thank you.

slorber commented 3 years ago

@nilsocket I don't think math blocks (latex/katex?) are really related to the markdown parser. But you are right, and we should make this easy. Can you explain better your usecase on this new issue I just created? https://github.com/facebook/docusaurus/issues/4625

lukejgaskell commented 3 years ago

@slorber is there an official way of handling this? I have a similar situation where I don't want my .md files validated with .mdx.

slorber commented 3 years ago

@lukejgaskell unfortunately no easy solution can be implemented in userland to solve this properly. The solution proposed by @borekb is likely the best you can do, and I understand you might be intimidating 😅

MDX is not a "validator" for md files, it converts those files to React components that are loaded as JS modules in the client app through webpack loaders.

To make this compatible with CommonMark, this would require the loader to not use MDX in some cases but use a different Remark parsing logic.

For .md files we even have 2 choices now:

convert those files to React components, but use CommonMark compatible processing (solution of @borekb )
convert those files to some AST that a small client-side runtime could render (it may be more performant for build time, but will have to poc this).

Some challenges to consider:

The goal is not only to support CommonMark, but also try to reduce build times/improve perfs for sites not needing MDX (or with limited usage)
Some non-MDX Docusaurus markdown features (admonitions, code blocks etc...) should rather keep working when switching the parser

This is something I want to work on but I don't have time in the short term.

lukejgaskell commented 3 years ago

@slorber That makes sense, thank you for the detailed explanation. If it's helpful, my use case is that I'm importing markdown from different sources to host on a single site. That markdown may or may not follow the same syntax as the current loader.

For example, some of it uses <pre> tags, or other HTML elements, but not always correctly... which makes me have to escape them. To fix my scenario I end up doing a bunch of regex parsing to get those files to align with the loader. Maybe there are other ways to handle these scenarios, but having loader options could be helpful as different sources have different lax practices on their markdown.

Usually it ends up breaking in the build (because of the mdx loader) even though I'd like it to just show a broken file in those scenarios. Anyways, here's the regex I end up doing to solve some of this:

const replaceLT = (m, group1) => (!group1 ? m : "&lt;");
const replaceGT = (m, group1) => (!group1 ? m : "&gt;");
const replaceFileLink = (m) => m.replace("(", "(pathname://");

async function run() {
  await replace({
    files: ["docs/**/*.md"],
    from: [
      /<pre>/g,
      /<\/pre>/g,
      /<!--.*-->/g,
      /\[.*?\]\(.*?\.(json|xlsx|xls|zip|docx|ps1)\)/g, // fix file type links to not be picked up by loader
      /\\`|`(?:\\`|[^`])*`|(<)/gm, //find all less than symbols that are not between backticks
      /\\`|`(?:\\`|[^`])*`|(>)/gm, //find all greater than symbols that are not between backticks
    ],
    to: ["```", "```", "", replaceFileLink, replaceLT, replaceGT],
  });
}

Josh-Cena commented 2 years ago

I'm 👎 on the point of letting users specify another parser, since it's very hard to make that line up with our build pipeline (e.g. remark plugins we already have, and the Markdown lifecycles we are to have). What could happen is we build a compatibility layer on top of the MDX compiler and transform incompatible syntax (style="" and class="" being the two most notable) to what MDX expects. Users can always extend Markdown syntax by installing/building custom remark plugins, so there's no need to swap out the parser. MDX (and the unified system behind it) is designed to be completely customizable. This is especially the case after we've migrated to MDX v2: there are much fewer quirks when JSX and Markdown co-exist.

slorber commented 2 years ago

yes, we'll see if it's still relevant after upgrading to MDX 2.

We'll need some dataset of existing commonmark docs to see what kind of issue we notice with MDX 2

slorber commented 2 years ago

@Josh-Cena apart markdown, some users might find it useful to user other content formats alongside the docs plugin (json, asciidocs...)

I think it could make sense to allow the docs plugin to emit content in different formats than a React component (MDX), and allow users to provide their own renderer.

Being able to pass content as json makes sense for a lot of tooling, and also CMS integrations that generally output JSON. We probably don't want to create artificial intermediate mdx files in this case.

Now it does not mean we'd add an alternative md parser ourselves, but having a flexible api could allow users to implement this themselves if they really need to.

Josh-Cena commented 2 years ago

Yes, but all these data formats eventually have to become some structured data that is compatible with our architecture. For example, JSON + React components work for external docs plugins. However, if we allow swapping our Markdown parsers with something else, how does the data transformation work like? Currently it's MDX -> JSX; does other parsers offer compilation to JSX-compatible formats?

slorber commented 2 years ago

They don't need to be converted to JSX. The node parser can create a JSON structure, and then the theme component can render that JSON structure (and the users can write this logic themselves).

The "content" prop could be JSON (mdast, hast, custom ast, proprietary cms json) or even just raw pre-formatted HTML strings

Josh-Cena commented 2 years ago

Okay, if the resolution is to let a custom parser return HTML string and render using dangerously set inner HTML then sure :) I'm just not sure how good it is to populate our theme component with all kinds of checks of what a Markdown import potentially returns though.

zepatrik commented 2 years ago

One major problem I am facing right now is that I auto-generate some docs pages from go code. It is theoretically possible to inject some HTML/js because of MDX. Therefore, the generated pages are HTML escaped (replacing < > & ' "). But then, such escaped characters are not rendered as expected in code samples: Screenshot from 2022-02-03 11-31-28 from

We have to admit, this is not easy if you don&#39;t speak jq fluently. What
about opening an issue and telling us what predefined selectors you want to
have? https://github.com/ory/kratos/issues/new/choose

```
kratos identities delete &lt;id-0 [id-1 ...]&gt; [flags]
```

In "standard" markdown there is no need to escape any non-trusted input, but in MDX there is. It would be way safer to say: "this is standard markdown form an untrusted source, don't try to run it as JS" instead of partially escaping stuff where I might miss some edge cases.

Josh-Cena commented 2 years ago

@zepatrik If you want to do post-processing, don't sanitize code in code blocks. Also, you can use a remark plugin to strip imports/exports very easily. Apart from import/exports, MDX can't execute arbitrary code.

zepatrik commented 2 years ago

Apart from import/exports, MDX can't execute arbitrary code.

Can you elaborate on that? I can easily run arbitrary javascript on the MDX playground using e.g.

<div onClick={() => fetch("https://google.com/").then(console.log).catch(console.log)}>Click me!</div>

Of course with that, I could e.g. leak stuff from local storage to one of my servers or do all kinds of things.

timothyerwin commented 2 years ago

what is the status on this? does docusaurus 2 split .md files to another parser? we are getting build errors for md files that work perfectly fine in github.

slorber commented 2 years ago

@timothyerwin all the updates are here, it's not necessary to ask.

Docusaurus is based on MDX, and you have to make sure your docs are compatible. This might require editing some of them, particularly HTML tags so that they conform with JSX.

zhalice2011 commented 1 year ago

I also have the same problem, the people who write the documents are not proficient in React. Then the official provided automatic migration script cannot convert markdown to mdx format very well.

Is there a way to specify that files with .mdx extension use docusaurus/mdx-loader, while files with .md extension use version 1.0 of the markdown renderer?

Looking forward to your reply.

slorber commented 1 year ago

With the upcoming Docusaurus 3, we upgrade to MDX 2 (https://github.com/facebook/docusaurus/pull/8288), and there's a format: 'md' compiler config that permits us to support CommonMark.

Note: the content is parsed as CommonMark, and it's not possible to use JSX inside that content anymore, but you can start using raw html and inline styles like on GitHub (enabled by https://github.com/facebook/docusaurus/pull/8960), but under the hood, the content is still compiled as a React component. Features such as admonitions, code blocks etc keep working.

If you want early access to these features, use a canary version of Docusaurus and follow what's written in this PR to turn on CommonMark: https://github.com/facebook/docusaurus/pull/8288 (for now just having .md extension is enough, but I might change this for v3)

ntucker commented 1 year ago

Can we have an option to disable commonmark? This is creating a lot of issues when I just want to use React 18.

slorber commented 1 year ago

@ntucker I was going to add a global format: 'mdx' option (and probably make it the default in v3), now there's even more reason to do so ;)

Note: you can use format: 'mdx' frontmatter on each file as a temporary workaround

ntucker commented 1 year ago

Altering every single file when the last edit time is used in the final site for publish time is not exciting to me. However, I'm very glad to hear about upcoming global control!

slorber commented 1 year ago

Note: the new CommonMark mode will be probably marked as experimental in v3.0 and opt-in.

The basic rendering works fine, but it is currently missing some Docusaurus features. Track https://github.com/facebook/docusaurus/issues/9092 to make sure the features you need are supported, and report missing unsupported features if you detect any.

slorber commented 1 year ago

As part of https://github.com/facebook/docusaurus/pull/9097, Docusaurus v3 will keep using MDX to parse .md files by default, but allow you to opt-in for explicit usage of CommonMark (for your whole site, for .md files, or on a per-file basis)

Limitations: there are some features not working yet with CommonMark, see https://github.com/facebook/docusaurus/issues/9092

cc @ntucker

nickserv commented 11 months ago

If you're coming from the blog and want to opt into CommonMark, use markdown: { format: "detect" } in your global config or format: md in Markdown front matter.

facebook / docusaurus