Thoughts on querying Markdown AST for custom rendering

I read some of your comments regarding doing cool things with Markdown source from #420

Query Markdown AST for advanced use cases. E.g. custom footnote rendering.

and from wooorm/mdast#13

So the obvious way for a client to query for markdown data is for the compiled HTML string. But for maximum flexibility I want to give people the option to query the markdown AST directly and build React components directly from the raw markdown AST data. They could specify which components they want to handle lists, paragraphs, headers, etc. This is why it'd be ideal to parse JSX and add the parsed nodes alongside the markdown nodes so the user could just match the Quiz node they put in the markdown with their custom Quiz React component.

This is something I've also been exploring for the last year or so, and I'd like to share some opinionated thoughts, and get feedback.

Essentially, Markdown, as specified, is not setup best for this to occur. All markup languages that exist today draw a line between "content" markup and "data" markup, which is correct imho. Data, by definition, is structured, may have a schema, can have relationships, etc. Content, on the other hand, should remain opaque. The way to bring them together is to embed data in content, not to extract data from content. The distinction is crucial.

For example, again from wooorm/mdast#13

This would also simplify creating custom Markdown React renderers e.g. footnotes that get rendered on a sidebar or as an inline widget which you click on to see the footnote.

Let's explore footnote specification in data and in content:

As content

Some long sentence. [^footnote]

[^footnote]: Test, [Link](https://google.com).

let footnotes = ast.filter(node => node.type === "footnote")

As data


---
footnotes:
    -   name: Test
        altText: Link
        url: https://google.com

---

Some long sentence. [^footnotes.Test]

let footnotes = node.footnotes

If we were to take it one step further, we would ask the question: "What if everything in Markdown was a React component?". Over the course of my work, this question has evolved to, "What if Markdown was simply a more easily readable/writable syntax of an XML-ish language that was NOT tied to HTML but rendered to any different target with a React-ish component model?"

Let's say this language is called Blub

First, we have the most readable/writable version of Blub


---
number: 3

---

# An Introduction to Blub

    Let's see what we can do if we break free from the shackles of Markdown, and its faith in HTML.

    There are {number} dolphins in the sea [^universalobjectdb@8naj7a6zhnziaoapp]

    <Quiz id="123" />

This introduces the concept of interpolating data into Markdown, the same way we do in React with the curly braces {}

Then, the fully qualified, canonical version of that same text in Blub

<Data type="yaml">number: 3</Data>
<Section title="An Introduction to Blub">
    <Paragraph>Let's see what we can do if we break free from the shackles of Markdown, and its faith in HTML.</Paragraph>
    <Paragraph>We have <Reference key="number" /> facts here, and because Internet, we must have a source <Footnote source="universalobjectdb" id="8naj7a6zhnziaoapp" /></Paragraph>
    <Quiz id="123" />
</Section>

While this looks like HTML, it's not! I believe that if we want to bring React and Markdown together, we should start from a new place that does so in a coherent, structured way.

Back to the footnotes example. Hmm, now while it may seem that we've come full circle and we'll need to query the AST to get all instances of the element, they distinction that remains for me is that we're not specifying the exact reference data. And because this is still not HTML, it's just an intermediary Tree that can be represented as XML (above) or as JSON within a running program, we can combine it with custom renderers to do whatever we want. It sort of brings together ideas that already exist in remark-react and reactdown in a more structured way.

render(document, {
    "Section": Renderers.MicrosoftWord.Section,
    "Quiz": AwesomeQuizRenderer,
    "Footnote": FantasmagoricalInlineFootnoteRendererWithPopup
})

You could also extend the minimal language itself by registering block or inline delimiters where

[
    {
        type: INLINE,
        element: "Reference",
        beginDelimiter: "{",
        endDelimiter: "}",
        parser: function() { }
    },
    {
        type: BLOCK,
        element: "GlobbyGook",
        beginDelimiter: "%%%",
        endDelimiter: "%%%",
    }
]

So now, we can do the following:

%%%
This is some {variable}
%%%

Which compiles to

<GlobbyGook>This is some <Reference key="variable" /></GlobbyGook>

Actually, a big reason I'm writing all of this is because you're going for similar outcomes, AND you have a successful project where people are using similar technologies. A large part of my motives are to get feedback/thoughts from you. So, what do you think?

TLDR There should be a more structured way to express data in content or content in data, with the writability/readiblity of Markdown. Do we even draw a line between the two (content vs data)?

PS I'm also thinking the other way, "What if all content is really just semantic data?". Where lists specified with - in Markdown can be named and becomes available as node.someList. This is a lot harder to reason about though, and I'm not as sure whether it'll work, but I'm exploring none-the-less. If you think this would be a better direction, I'd love to get your thoughts even more so!

PPS A lot of this is building towards the universalobjectdb for me - > essentially Freebase + Google's Knowledge Graph + Git + React + GraphQL

@andreypopp - this is similar to what you're doing with reactdown. would love to get your thoughts!

Huh, this is super interesting. So basically what you're saying is "screw just embedding react components in Markdown, let's build a new compile-to-jsx language (based on Markdown)"? Treating Markdown as not just a nicer way to render HTML but really a nicer way to write JSX is a fine goal.

Or... how much do you agree with what I'm thinking? I don't quite get your content vs. data distinction (they seem about the same to me in this context). For me, I want most anything not actually Javascript out of Gatsby and handle that in GraphQL so there's maximum flexibility. Webpack is great but it's too rigid and one directional (file => module). Which is perfect when a file can be treated as an atomic thing but if you want to query parts of that file or programmatically manipulate in some way from the client it falls short.

E.g. an interesting and awesome thing we can do by querying the markdown AST is auto-convert images in Markdown to a smarter GraphQL image type something like:

{
  image(width: 400) {
    src
    retinaSrc
    preview // returns base64 encoded ~20px wide version
  }
}

The Image component on the frontend would then only load the actual image if it came into the viewport.

Very similar to what Facebook does here: https://code.facebook.com/posts/991252547593574/the-technology-behind-preview-photos/ Or what Medium.com does.

Yup, that's pretty much exactly what I'm saying, and I completely agree with what you're saying, and would absolutely love to see it happen too, along with the previous footnote example, and much more.

To make sure I've understood the image example it, I'm going to try a simple exercise:

Markdown source document

Something funny about cars, planes, and trains

![Space Shuttle](https://upload.wikimedia.org/wikipedia/commons/d/d6/STS120LaunchHiRes-edit1.jpg)

Parsed AST

[
  {
    nodeType: "paragraph",
    children: [
      {
        nodeType: "text",
        value: "Something funny about cars, planes, and trains"
      }
    ]
  }, {
    nodeType: "image",
    altText: "Space Shuttle",
    src: "https://upload.wikimedia.org/wikipedia/commons/d/d6/STS120LaunchHiRes-edit1.jpg"
  }
]

With the GraphQL model, I'm guessing that in the parent page-level renderer, you'd do something like so:

export default function Page({ data }) {
  const image = get(data, "node.markdownAST.image")
  return (
    // all the markdown before this
    <img src={image.preview} dataLazyLoad={image.src}/>
    // the rest of the page
  )
}

export const routeQuery = `
  {
    node {
      frontmatter
      markdownAST {
        image {
          preview
          src
        }
      }
    }
  }
`

With the Markdown-as-a-nicer-way-to-write-JSX, the original document gets compiled to:

<Paragraph>Something funny about cars, planes, and trains</Paragraph>
<Image altText="Space Shuttle" src="https://upload.wikimedia.org/wikipedia/commons/d/d6/STS120LaunchHiRes-edit1.jpg" />

And the Image Renderer could look something like this:

export default function Image({ altText, src }) {
  const preview = getPreviewOfImageSomehowPossibly(src)
  return (
    <img src={preview} dataLazyLoad={src} altText={altText} />
  )
}

Or another way:

export default function Image({ altText, src, data }) {
  return (
    <img src={data.image.preview)} dataLazyLoad={src} altText={altText} />
  )
}

export function graphQuery({ src }) { // receives original props
  return `
    {
      image(src: src) {
        preview
      }
    }
  `
}

This is somewhat where the content vs data dichotomy comes in, because content in it's current incarnation is naturally completely opaque and unstructured; its not data. By bringing GraphQL into the mix to query the AST, we're treating it as data. I'm not so sure if that's something that should be done, imho, because it becomes really difficult to reason about what goes where for the writer and the developer.

To expand, if content AST were to be treated as data that can be queried and manipulated, why stop at just images? Why not embed all kinds of data into the AST and query it? Why not use Markdown lists instead of YAML lists and JSON arrays? This is what I was talking about to in the PS section of the first comment. I can't reason about all data being representable in content, which is why I'm not as sure about it, and prefer the React component model. I'm still interested in continuing to explore down the path of data in content, because it's so out there and could be fruitful, but so far, I've found it difficult to mentally model an easy-to-use syntax this way.

On the other hand, the opposite way is easier to reason about, where content is embedded in data. YAML already does this, where you can have multiline strings, so you can essentially have Markdown or any other content markup language embedded in different keys of a file, not just the main body of a file (as it is with Markdown).

out of Gatsby and handle that in GraphQL so there's maximum flexibility. Webpack is great but it's too rigid and one directional (file => module). Which is perfect when a file can be treated as an atomic thing but if you want to query parts of that file or programmatically manipulate in some way from the client it falls short.

Completely agree with you there, which is why I'm really excited about the source plugins for 1.0, and why I made fsdb, and catalyst as well. GraphQL is the perfect abstraction for this kind of stuff. In fact, since the crux of all API interfaces are really their data models, I can totally imagine GraphQL taking over the world!

Very similar to what Facebook does here: https://code.facebook.com/posts/991252547593574/the-technology-behind-preview-photos/ Or what Medium.com does.

On a total side note, I loved that technique so much that I made a small library that would modify webpages on the server, embed a blurred base64 encoded image in the HTML, and serve the original on page-load with it's own lazy loader of lazyload.js 😄

So I spent a bit of my morning thinking about this again, and a great example I want to try is a resume. I'm going to do a pure content (Markdown) based approach, a pure data (yaml) based approach, and then try to do a data-in-content approach (random syntax), to push this a little further.

Let's say the job is to render a pill-shaped status bar (like Github's languages-in-the-repo bar) except for years spent on each category (work, education, etc..)

Content

# Bugs Bunny's Resume

## Education

### Quacks-a-Lot University
**Bachelor of Applied Elmer Sciences**
*From last week to present*

## Work Experience

### UberCarrot
**Head of Consumption**
*From birth to death*

Certainly, this is very simple to write for most people; there's no need to reason too much about structure, just formatting. Querying the resulting AST however becomes more difficult and error-prone. There's no standardized way to represent a resume in Markdown, so until someone tells me that a triple heading is the name of the organization, I have no idea what I'm looking for, so even if it's doable, it isn't portable. The markdown and the renderer are forever tied together.

Data

type: resume
education:
  - place: Quacks-a-Lot University
    degree: Bachelor of Applied Elmer Sciences
    start: 07/09/2016
    isCurrent: true
work:
  - org: UberCarrot
    position: Head of Consumption
    start: 27/07/1940
    isCurrent: true

This becomes super easy to reason about and render. It's just data, that any function can take as input, and produce HTML or text or whatever else we're targeting.

Data in Content

Now this is the interesting piece, and the reason for me to write this all up. Let's see if we can semantically place structured data in content, for easy extraction at a later point. Bear in mind, I'm articulating this pretty much for the first time for someone else's consumption (ie. coherently) as I type, so you're getting stream of thought.

<ResumeItem>
### UberCarrot
**Head of Consumption**
*From birth to death*
</ResumeItem>

This could be a reasonable first step - wrap an atomic piece of data in a portable wrapper. We could use any kind of syntax, doesn't have to be XML-ish. Let's make it simpler for the sake of writability/readability, and assume that from now on the ^^^ character sequence delineates a resume block.

^^^
### UberCarrot
**Head of Consumption**
*From birth to death*
^^^

This approach will probably break down at some point, because we'll either run out of easily-memorable sequence of characters to delineate data structures, or documents will be so littered with esoteric sequences that people won't bother using it. Let's say we take it the other way, and structure the data a little more.

<ResumeItem org="UberCarrot" position="Head of Consumption" time="From birth to death" />

The problem still remains, that we'll need a new XML tag for every kind of data structure we want to represent. Unbounded, this could easily become a full blown XML/HTML document, exactly the thing we're trying to replace. And besides, there have been countless more languages invented to do represent data better than XML! In fact, imho, XML is the absolute zenith of a data-represented-as-content syntax language.

Which is precisely why I draw the line between data and content with the following question: If it needs to be queried, manipulated, or structurally understood in any way, it's data. If you just need to render it, especially if you're rendering complex/unique things written declaratively, it's content. One is opaque, the other is not For me, a data language is the ultimate, root level container, where some keys within a data file don't just contain plain strings, but complex markup.

Ok, I'm getting where you're coming from now. For me, it still doesn't help much to draw the distinction. I think what you're seeing as opaque is free form text as text is hard to slice & dice much further (though NLP could help — which would be an interesting GraphQL schema to expose tbh. Jekyll will use latent semantic indexing to show you related posts). SO "data" e.g. lists or numbers or whatever appears more granular but I'd say that is only a perception. Even a number could hold worlds of meaning. At some point while querying you'll always reach the limit of "data" and get stuck at an opaque, complex, unique "content" thing.

All syntaxes from roman alphabet to hieroglyphs to programming languages, to Markdown etc. etc. are for capturing mental models and all make different trade offs on the ease and fidelity you can capture different types of mental models.

Markdown is optimized for easy writing of long-form content. It has baked in all sorts of assumptions about what you're doing. It doesn't ever let you get outside of writing long-form text. So use it for anything else and it stops making sense. So I don't see Markdown as a better JSX. If you want to write complex nested documents with embedded logic and state — just use JSX! That's what it's designed for! If you want to write an essay, use Markdown. If you want to store data for a resume, use YAML (or unless I needed to reuse it multiple places, I'd just put it straight in a component).

My intention with querying the Markdown AST w/ matching React.js stuff is to merely enhance the Markdown => web conversion a bit. E.g. auto-resize images and make them responsive & progressive. Make footnotes nice. Replace the normal link with the from React Router, etc.

Sorry, it's been a while, and I haven't gotten a chance to put things down. I'd like to share some progress on the aforementioned library with you when I have it, till then I'm going to close this issue. I hope querying Markdown's AST works well for you!

Awesome! Next thing I'm working on is being able to add custom GraphQL types to your site schema so looking forward to seeing your library in Gatsby!

gatsbyjs / gatsby

Thoughts on querying Markdown AST for custom rendering #444