Design: Markdown in `label`/`hint` (and `TextRange` engine/client API generally)

This issue is intended to support #62, and will cover both:

The engine/client interface approach
The specific approach to implementation in the engine

These don't necessarily need to be coupled, but it will make quite a bit of sense for them to be for a first pass.

First, I'll list some assumptions about requirements for the feature.

Then in the spirit of including multiple design options to choose from, I'll discuss these options:

Port Collect's implementation directly
Use an established parser, produce structured data suitable for a client h function

Assumptions/requirements

"Markdown" is a convenient shorthand for our use case, but has a broader (and more varied) meaning than we intend to support.
Our support will be intentionally limited to a subset of common Markdown features and syntax. Specifically, as understood from reading Collect's source implementation, that subset will (initially) be:
- Headings—#, ##, ###[, ...]—up to some depth¹
- Paragraphs—non-blank lines separated by two or more line breaks²
- Links—[link text](url)
- Bold—**bold text**, __alternative syntax__³
- Italics—_italic text_, *alternative syntax_³
- Limited support for styling—<span style="...">, supporting either/both of color and font-face style properties.
  - Any other styles specified by a form will be omitted.
  - Any other attributes specified on a <span> will be omitted.
- Any other HTML tags and attributes will be escaped.
¹ Open question: what depth? My instinct is that limiting the depth to 3 is good for usability (for form designers and end users alike). I don't see any limitation in the Collect source. On one hand, introducing a limitation in Web Forms would technically break consistency with Collect. On the other hand, Collect may produce <h7>... which is not valid HTML! We will break consistency no matter what, it's a question of where we draw the line.

² While I don't see support for single-line breaks in Collect, it's worth considering whether we want to support this as well. Fussiness around this functionality is, however, one of the most common gripes about many Markdown implementations. If we do support it, we'll need to decide what its syntax requirements should be—and we'll probably want to do so early, and keep it stable from there.

³ These "alternative syntax" cases are a part of the original Markdown implementation, inherited by most specs and implementations. Some Markdown-like formats (like Slack's) diverge. Nevertheless, Collect's implementation supports both syntax options for both emphasis cases. So it is assumed that we will too. But we may want to analyze available real-world forms and see if it's reasonable to support something more like Slack's variation (which I think most users find easier to understand).
Our Markdown implementation must account for the possible presence of <output>s in form definitions:
- Outputs must be computed and interpolated in our formatted/Markdown representation just as they would in our plain text representation
- Output computations must be treated as plain text: if a computation happens to produce characters with Markdown formatting implications, we will escape those characters rather than formatting them.

Option 0: Port from Collect

I'm labeling this "option 0" because it's about as close as we're going to get to a "null option". This option has some implications:

We'll inherit all of the quirks of the Collect implementation. This is good for consistency, but might have some drawbacks in terms of aligning with more conventional Markdown implementations and their expected behavior.
The handling of <output> is separate from that implementation, and will require some special consideration.
Clients must use and trust arbitrary HTML from the engine. This has more specific implications for:
- safety: any flaw (e.g. XSS) potentially affects all clients supporting formatted text
- flexibility: clients must do extra work to re-parse and re-process the formatted HTML to do anything other than render it exactly as produced by the engine
- performance: any client which might benefit from fine-grained updates will lose that capability for Markdown-formatted text from the engine

From a client perspective, this option would be consumed as:

interface TextRange {
   /* ... */
-  get formatted(): unknown;
+  get formatted(): string; // Arbitrary transformed blob of HTML
}

Option 1: Established parser, structured format, `h`

Some clarification of `h`

We've discussed this in some chats/meetings, but I think detailing it here is a good opportunity to make the thinking behind this option clear for posterity—and as a potential reference point for hypothetical future clients on other platforms. The so-called `h` (or "hyperscript") function is a semi-formalism of the concept that programmatic generation of structured markup tends to follow a common pattern: `h(elementName, properties, ...children)` (though the signature can vary by implementation). This concept is effectively used in some form or another, to varying degrees, by nearly all of the currently popular web frameworks—including those where authoring is done in vanilla JS, as well as many compile-to-JS syntax extensions like JSX, and many other compile-to-JS languages. It's even used by, or compatible with, many non-web UI solutions for other platforms. It is effectively the underlying concept behind nearly all JSX implementations (including Vue's, React and Preact, Solid without its custom `dom-expressions` transform). It is also the underlying runtime concept used internally by the more idiomatic Vue SFC template language.

This option would entail processing Markdown with an established parser of our choosing.

Which parser?

Based on my research and a fairly thorough prototype of this proposal, I think mdast-util-from-markdown is an excellent candidate. This parses Markdown into an AST, with the same parser used by:

remark, popular in projects with composable/customizable Markdown use cases
MDX, popular in projects which intermix arbitrary components in Markdown
Astro
Next.js

[... snip ...] This list could go on and on.

It's also worth considering some other parsers. Insofar as we're not migrating our XPath parsing off tree-sitter, that's a valid option (likely at the cost of page weight). Some other JS-based Markdown parsers at least plausibly claim to be faster, but in my experience they will have greater integration challenges.

Whichever parser we choose, we'd have a Markdown processing pipeline that looks roughly like:

parse(markdownText) -> AST, where the parser-produced AST is likely broader than the Markdown subset we'll support
walk(AST) -> StructuredFormat, where we map aspects of the parser-produced AST either to our own Markdown-subset representation; in some cases, we'd map unsupported Markdown functionality back to its corresponding raw source text (thus achieving our Markdown subset)

Structured format

The format structure I'd propose would roughly resemble a very simple, minimal "VNode" (as in "virtual DOM node") tree of elements. We can choose an interface specifically suitable for a particular client framework (i.e. Vue). Or we can choose a more general structure of our own design, which would impose a small amount of mapping duty on all clients. I don't feel very strongly about either, they both have their benefits and drawbacks.

This is not intended to be proscriptive about the structure, but it captures the essential concept:

interface MarkdownElement {
  elementName: string;
  properties: Record<string, unknown>;
  children: MarkdownChild[];
}

type MarkdownChild = MarkdownElement | string;

However, this is more general than necessary. We know we will support a very specific subset of Markdown, so we can be more detailed about what that subset will look like for clients:

Detailed element interface examples

```ts interface MarkdownHeadingElement { elementName: 'h1' | 'h2' | 'h3' /* | ...? */; properties: EmptyObject; // Assume such a type exists 🙃; or: `{ lang: string }` children: [string]; // Consistent with Collect } interface MarkdownParagraphElement { elementName: 'p'; properties: EmptyObject; // Or: `{ lang: string }` children: MarkdownInlineChild[]; } type MarkdownBlockElement = | MarkdownHeadingElement | MarkdownParagraphElement; interface MarkdownOutputElement { // Note: clients can choose to produce an `` in HTML, or just unwrap its string value. elementName: 'output'; // Note: while XForms and HTML `` are semantically similar, XForms' `value` attribute // doesn't map very well to HTML's `for` attribute. properties: EmptyObject; children: [string]; } interface MarkdownStyledElement { elementName: 'span'; properties: { style: { color?: string; 'font-face'?: string; }; }; children: MarkdownInlineChild[]; } interface MarkdownEmphasisElement { elementName: 'em' | 'strong'; properties: EmptyObject; children: MarkdownInlineChild[]; } interface MarkdownLinkElement { elementName: 'a'; properties: { href: string; // Maybe also: `target: '_blank';` }; children: MarkdownInlineChild[]; } type MarkdownInlineChild = | MarkdownOutputElement | MarkdownStyledElement | MarkdownEmphasisElement | MarkdownLinkElement | string; ```

This would be consumed by clients as:

interface TextRange {
   /* ... */
-  get formatted(): unknown;
+  get formatted(): MarkdownElement[]; // Or MarkdownBlockElement[] from the more detailed examples
}

Advantages of this approach

We're not responsible for parsing Markdown. This isn't core to our functionality, and we benefit from the hardening of a mature solution with widespread usage. An obvious example of a concern in the Collect implementation: we can be sure that whitespace around _ is handled in a predictable way that will almost certainly match user expectations.
Relatively trivial and low risk to evolve. We can add support for other styles with a whitelist, introduce support for single line breaks at a later date, add support for nested formatting in e.g. headings, ...
Client flexibility.
- Some clients may want a stricter Markdown subset than the engine produces. An obvious example might be limiting the colors a form can use.
- Because the data is structured, clients could also adjust certain colors to support features like dark mode.
- The Collect solution controls where links open, a structured format allows clients to determine or that, or to easily override an engine-produced default. This is compelling especially if we anticipate optionally supporting rendering forms in an <iframe>... or in a native app's embedded web view... or...
Better performance. We can update subsets of a structured format independently, e.g. just the portion representing an <output>, or just the jr:itext().

Option 1b: option 1, but apply subset of Markdown in clients

This would be basically the same as option 1, except clients would have:

greater flexibility to determine what subset of Markdown is supported
greater burden to handle the subsetting logic

Option 1c: option 1 (or 1b) + HTML serialization in the engine

While I want to discourage producing and consuming arbitrary blobs of HTML, I do recognize that it has some appealing conveniences for some use cases. We can consider extending option 1 to include both the structured format as well as an HTML serialization of it. For a client, this would look like:


interface TextRange {
   /* ... */
-  get formatted(): unknown;
+  get formatted(): MarkdownElement[]; // Or MarkdownBlockElement[]
+  get asHTML(): string; // Consider: `unsafe_asHTML` or some other discouraging name
   get asString(): string;
}

getodk / web-forms