Open tgross35 opened 1 year ago
GitHub also added support for Mermaid diagrams somewhat recently. This illustrates the need for a CommonMark specification to guide further implementations.
This point to me precisely illustrates that no syntax extension is needed. GH uses code (with backticks). No new dollar support is needed.
CM already allows:
```math
\frac{1}{2}
\frac{1}{2}
why is an additional:
````markdown
$$$math
\frac{1}{2}
$$$
needed?
This PR defines "display blocks" that follow the same definition rules as code blocks but are intended to render their content into a display form, rather than a verbatim representation. By default these should process the data as TeX and output MathML, but the info string can be used to change the renderer to something like asciidoc, mermaid, graphviz, or svg.
The current code blocks allow for syntax highlighting but do not specify that a particular syntax highlighting library is implemented. They do this by exposing the info string as a class.
This “display blocks” PR seems to require that every markdown -> HTML compiler implements a particular LaTeX math -> MathML transform. I think this particularly means that every markdown compiler now shows a very big and heavy transform that not every user of markdown might want, TeX math as input which not everyone might want, and MathML output which not everyone might want.
To illustrate, the smallest CM compliant markdown parser that I am aware of is 15kb minzipped. Adding support for roughly this PR with KaTeX adds 75kb minzipped.
MathML is supported on every major browser only as of recently1, so providing a math implementation should be somewhat trivial
Do you have an example of how LaTeX -> MathML is trivial?
Some more Qs:
$$html
?$$ascii do-things whatever="yep"
?The idea of distinguishing code blocks meant to be displayed as code (possibly highlighted) and code blocks meant to be interpreted (e.g. executed or rendered) was discussed extensively here: https://talk.commonmark.org/t/mermaid-generation-of-diagrams-and-flowcharts-from-text-in-a-similar-manner-as-markdown/1882/1
GitHub also added support for Mermaid diagrams somewhat recently. This illustrates the need for a CommonMark specification to guide further implementations.
This point to me precisely illustrates that no syntax extension is needed. GH uses code (with backticks). No new dollar support is needed.
CM already allows:
```math \frac{1}{2}
\frac{1}{2}
why is an additional:
$$$math \frac{1}{2} $$$
needed?
There are a few reasons:
!
was accepted.This “display blocks” PR seems to require that every markdown -> HTML compiler implements a particular LaTeX math -> MathML transform. I think this particularly means that every markdown compiler now shows a very big and heavy transform that not every user of markdown might want, TeX math as input which not everyone might want, and MathML output which not everyone might want.
To illustrate, the smallest CM compliant markdown parser that I am aware of is 15kb minzipped. Adding support for roughly this PR with KaTeX adds 75kb minzipped.
I think that it is acceptable to say that a parser is allowed to not perform any display block rendering, and instead present display blocks the same as code blocks if they do not support it. Most important is that there is an official choice for how display blocks should be notated so that parsers don't need to make an independent choice.
Supporting JS->MathML should also be more lightweight than KaTeX/MathJax, which need to add styling.
MathML is supported on every major browser only as of recently1, so providing a math implementation should be somewhat trivial
Do you have an example of how LaTeX -> MathML is trivial?
Trivial was not the best word here, "well defined" is more accurate. I simply say that because there are already a few libraries to do the conversion:
Some more Qs:
* What happens for other names, e.g., `$$html`?
I think that a renderer could probably treat this the same as inline html, but I don't know whether this should be required (more below)
* What about stuff after the initial word, e.g., `$$ascii do-things whatever="yep"`?
This would follow the guidelines for codeblocks, the renderer can decide what to do with this information
* You allow SVG in your examples, SVG can be unsafe. As it can contain arbitrary CSS and JS. This introduces an XSS vulnerability assuming authors cannot be trusted
Thank you for pointing this out, I will remove it.
I think a general short summary of what I am proposing would be:
"New syntax blocks $$...$$
and inline $...$
indicate content that should be somehow rendered. By default, data within these blocks is by default math; an implementation may use the info string to specify other render modes. If a parser does not support the given render mode (including the default math), it should treat the section as a code block"
GitHub also added support for Mermaid diagrams somewhat recently. This illustrates the need for a CommonMark specification to guide further implementations.
GitLab has had this support for quite awhile, and in fact extends it to PlanutUML and Kroki (https://docs.gitlab.com/ee/user/markdown.html#diagrams-and-flowcharts)
They have also had math support for quite awhile using ```math
. They originally solved the inline math problem by using a new syntax $`inline math`$
. This has also been adopted by GitHub recently, https://docs.github.com/en/get-started/writing-on-github/working-with-advanced-formatting/writing-mathematical-expressions#writing-inline-expressions
However the desire for a well-defined and supported implementation for the $
and $$
syntax is because that is how math is supported in markdown today, for quite a while. By making it an official part of the spec, it would finally codify it and allow it to be implemented with common rules.
I'm biased in preferring to use ```mermaid
over a new $$mermaid
. But it would be nice to see the dollar math syntax become official and natively supported by implementations.
Currently, there isn't an official way to differentiate between something that should be rendered or something that should be displayed as text. I have to read through all of what @vassudanagunta linked, but I don't believe the ! was accepted.
Perhaps there doesn’t need to be. I don’t think it breaks with CM when Pulldown turns ```math
into evaluated math. Or when it turns ```math eval
into evaluated math.
Since inline code doesn't have an info string, there is no way to do inline math
There is no info string for inline code either, that might be nice to have too. And “inline math” would not have a way to disambiguate between ascii math or TeX math etc.
So perhaps a proper solution is needed for inline code/math, to allow tagging as a particular language?
It seems to be the status quo for markdown tools that require math support, which is pretty widely desired. Libraries that do not support math tend to not have it just because there is no standard (this was the case with e.g. pulldown-cmark I believe)
I think my first paragraph in this comment answers that need.
Perhaps we can solve this with a recommendation on whether to evaluate things?
I personally like ```mermaid eval
.
That is to say, everything that’s already specified, with the first “word” being used as a programming language as a class, everything after it being ignored, although when the first “word” of it is say eval
, tools may choose to evaluate the code instead of just showing the code. That should break virtually nothing. It would also allow literate programming languages (```python eval
?). And be in line with existing markdown (“if it doesn’t render it is still readable”)
However the desire for a well-defined and supported implementation for the $ and $$ syntax is because that is how math is supported in markdown today, for quite a while. By making it an official part of the spec, it would finally codify it and allow it to be implemented with common rules.
There is a significant problem with trying to add this: single dollar support will break many existing markdown documents, because single dolars are quite common in (American English) natural language.
I personally like
```mermaid eval
. That is to say, everything that’s already specified, with the first “word” being used as a programming language as a class, everything after it being ignored, although when the first “word” of it is sayeval
, tools may choose to evaluate the code instead of just showing the code. That should break virtually nothing. It would also allow literate programming languages (```python eval
?). And be in line with existing markdown (“if it doesn’t render it is still readable”)
What do you think of this proposal? I compare the proposals here. ```mermaid eval
is most like Option D, my proposal, except mine avoids English and also is backwards compatible, e.g. with ```mermaid
already established as evaluated by many tools, not just GitHub.
Yep, I also most like opt D. I do prefer that ```x
always means show code instead of adding an =
. And I do prefer eval
instead of ()
. I think a keyword is fine in this place, all these programming names are also influenced by English; the language name is used in a class, which to be useful with selectors typically uses letters; and js ()
vs. js()
introduces a complexity that “words” doesn’t have
Predefined English keywords are never okay in CM.
a) I don't think that is said in the spec, b) this place is literally about words, the first which is used, the rest currently dropped
(been away for a bit, just getting back)
Currently, there isn't an official way to differentiate between something that should be rendered or something that should be displayed as text. I have to read through all of what @vassudanagunta linked, but I don't believe the ! was accepted.
Perhaps there doesn’t need to be. I don’t think it breaks with CM when Pulldown turns
```math
into evaluated math. Or when it turns```math eval
into evaluated math.
I think the first example works better for math where ```math
and ```tex
are two obviously different things. It is less clear for something like mermaid where you don't have two names for the code and the render-as.
The eval
tag or equivalent resolved this though, and I think that could work nicely.
Since inline code doesn't have an info string, there is no way to do inline math
There is no info string for inline code either, that might be nice to have too. And “inline math” would not have a way to disambiguate between ascii math or TeX math etc.
So perhaps a proper solution is needed for inline code/math, to allow tagging as a particular language?
This is something I was considering proposing as well, specifically adopting RST's :info:`code string`
. That is probably worth investigating in any case, I'll create a PR for it when I get the chance.
However the desire for a well-defined and supported implementation for the $ and $$ syntax is because that is how math is supported in markdown today, for quite a while. By making it an official part of the spec, it would finally codify it and allow it to be implemented with common rules.
There is a significant problem with trying to add this: single dollar support will break many existing markdown documents, because single dolars are quite common in (American English) natural language.
I don't think this is likely to be much of a problem because if openers need preceding whitespace and closers need postceding whitespace, phrases like $100.00 + $20.00
would not match. $100+$ 20...
would, but I can't think of when somebody would use a standalone dollar sign.
It is also already used by Github, Gitlab, tex all the things, stackoverflow and others, so I assume it has turned out not to be an issue.
There is a significant problem with trying to add this: single dollar support will break many existing markdown documents, because single dolars are quite common in (American English) natural language.
I don't think this is likely to be much of a problem because if openers need preceding whitespace and closers need postceding whitespace, phrases like $100.00 + $20.00 would not match. $100+$ 20... would, but I can't think of when somebody would use a standalone dollar sign.
I completely agree. And I think the spec laid out by @jgm in https://github.com/jgm/commonmark-hs/blob/master/commonmark-extensions/test/math.md does a great job.
Honestly I'm not crazy about eval
. I would prefer something along the lines of
```! mermaid
diagram code
That feels more "markdownish" to me. I guess because I'm a programmer it makes sense. It would even translate well into inline code, such as `` `! 1 + 2 ` ``, though I have no idea how you mark that as a specific language.
Single dollars are an issue, to some degree, depending on audience.
Many systems using them add some heuristics on whitespace etc. to mitigate this (but there is no consensus, and frequently no docs on exact heuristics!), or require opt-in to enable this syntax for particular user. Search https://github.com/cben/mathdown/wiki/Math-in-MarkDown for $inline$
.
I like to bring the example of Electronics.SE which deviated from several other SE sites by using \$inline math\$
syntax. Not sure why, the audience should be comfortable with math, but I presume they already had lots of content with prices and weren't willing to break back-compatibility when they added math support?
GitLab heuristics reject $10 to $20
as plain text, but render non-$n^2$-secure
: https://gitlab.com/cben/sandbox#single-dollar-math
Undocumented, just says you can use $...$
.
GitHub heuristics reject $10 to $20
as plain text, and also reject non-$n^2$-secure
: https://github.com/cben/sandbox#single-dollar-math
Again undocumented, just says you can surround the expression with dollar symbols
Oh, a juicy bit they do admit is they do math detection on line-per-line basis. A dollar that's rejected by heuristics may still need escaping if it's on same line with valid math?
So I think single dollars will unavoidably
The is also the matter that people tend to use unmodified off-the-shelf parsers and bolt-on math rendering around it, resulting in bugs e.g. ${a}_{b}$ text ${c}_{d}$
first becoming ${a}<em>{b}$ text ${c}</em>{d}$
where "text" becomes italic and subscripts are not subscripts :neutral_face:
I'd estimate about half of all math-in-markdown implementations I've seen have, or had at some point, such bugs! When pressed they sometimes still avoid carefully modifying a parser by adding pre-processing — reducing but not 100% eliminating bugs...
So I don't have high hopes for $\latex$
getting standardized. It's fine as optional extra for communities already used to it.
Options IMHO more viable to achieve interoperability:
$`literal-based inline math`$
.
This is already correctly tokenized by any markdown parser as a literal surrounded by dollars; and it's safe to render in post-processing alone :sparkles: without compromising correctness.$$math$$
both for inline and display (a paragraph containing only math becomes display).```mermaid```
vs. ```!mermaid```
than introducing a new char.To put it another way: the reason I prefer literal-based syntaxes for math is better interop with existing parsers. Proclaiming $$
delimiters shall have the same literal powers is neat but would take a long time to materialize actual support.
Pandoc has pretty solid heuristics for $
delimited math; these cover all the cases that GitHub can't handle. But getting this right is a bit complex.
For djot I just went with $
or $$
followed by a verbatim (backtick) span. That's similar to GitHub's syntax (which I didn't know about then), but without the following $
, and unlike GH's it offers a way to do display math as a component of a paragraph (rather than a separate block).
Many systems using them add some heuristics on whitespace etc. to mitigate this (but there is no consensus, and frequently no docs on exact heuristics!)
FWIW this describes the way I do it in cmarkit
.
I opened a PR to start discussion on info strings for inline code: https://github.com/commonmark/commonmark-spec/pull/750
I think that a straightforward question that we could probably answer now is, do we want to support new syntax at all (likely something $
related) or rely on info strings? Even if not the "display block" idea proposed here
Since so many implementations support blocks with $$\n...\n$$
and it is well known and used, I do feel like it would be a bit of a miss to not support that syntax in some form, since it is an opportunity for CM to help unify what is out there.. I don't think I have seen ```math
used as much in practice, even though support is out there. Inline style is less unanimous.
The eval tag or equivalent resolved this though, and I think that could work nicely.
Perfect! :)
don't think this is likely to be much of a problem because if openers need preceding whitespace and closers need postceding whitespace
I have doubts. In CM we have another place where whitespace is important, emphasis/strong. That work quite badly in languages that don’t use (much) whitespace such as Chinese. There are issues about that here (and I get more in my projects).
I feel better about reusing the syntax for code, and adding onto it in the currently specifically unused space: meta string after the first word.
Honestly I'm not crazy about eval. I would prefer something along the lines of
```! mermaid diagram code
That feels more "markdownish" to me. I guess because I'm a programmer it makes sense. It would even > translate well into inline code, such as `! 1 + 2 `, though I have no idea how you mark that as a > specific language.
I feel like this !
is quite hard to explain to users tho. It isn’t very obvious that ```! js
is asking a tool to render/evaluate that code, ```js render
, ```js eval
, or ```js evaluate
does seem like that, to me.
There’s also the thing where this breaks every tool that deals with markdown that exists. ```! js
will result in <code class="!">
everywhere that doesn’t get updated. It’s going to take years for that to change on GH and many other places.
This meta space after the language name is explicitly ignored by CM, any markdown tool can start supporting it already. And if they don’t support it yet, the code will be displayed, likely syntax highlighted.
Using the meta space also doesn’t break mermaid on GH.
I’m not going to block ```js !
, but I feel like letters are nicer looking / easier to explain.
Predefined English keywords are never okay in CM.
a) I don't think that is said in the spec, b) this place is literally about words, the first which is used, the rest currently dropped
Yes, but the spec doesn't specify any words, and the de facto standard that the language of the code block or the filename extension associated with said language is the first token of the info string doesn't introduce English into the spec/standard. The names “Rust”, “Javascript”, “Markdown” and their filename extensions are crosslingual names and identifiers.^1
I feel like this ! is quite hard to explain to users tho. It isn’t very obvious that
! js is asking a tool to render/evaluate that code,
js render,js eval, or
js evaluate does seem like that, to me.I’m not going to block ```js !, but I feel like letters are nicer looking / easier to explain.
I think for the person reading the Markdown as content ("the reader") rather than as source code, which as you've mentioned is something Markdown supports as a priority, it doesn't matter that they don't know what the !
means. They will be reading the source code within the block as-is regardless. In fact, keeping the eval
directive to a single character is an advantage; it's easy for the reader to ignore.
It only matters to the person writing the Markdown as source code for rendering ("the writer"). It is reasonable to expect the writer to learn this difference, just as they must learn the other rules of fenced code blocks that impact rendering and all the other rules of Markdown.
The reader doesn't need to know any of those rules. Markdown's structure is designed to be self-evident for readers. None of the intricacies in the extremely long CommonMark spec matter to the reader, only to the writer, as long as the writer doesn't abuse those intricacies to produce content without self-evident structure, or use any of the writer conveniences that are reader-unfriendly (e.g. lazy continuation) unless they choose to dispense with Markdown's reader friendliness. Such a choice will naturally have the exact same property of self-evidence, because a writer is always also a reader. A writer can only willfully produce reader-unfriendly content with Markdown.
I don't remember if the above distinction between reader and writer and the Markdown goals for each is articulated anywhere in the CommonMark spec or website, or on Gruber's website.
There’s also the thing where this breaks every tool that deals with markdown that exists.
```! js
will result in<code class="!">
everywhere that doesn’t get updated. It’s going to take years for that to change on GH and many other places.This meta space after the language name is explicitly ignored by CM, any markdown tool can start supporting it already. And if they don’t support it yet, the code will be displayed, likely syntax highlighted.
Using the meta space also doesn’t break mermaid on GH.
Yes, we absolutely cannot break things other than misused corner cases. Such changes really belong in a new language, e.g. djot.
de facto standard
From the spec:
The first word of the info string is typically used to specify the language of the code sample, and rendered in the class attribute of the code tag. However, this spec does not mandate any particular treatment of the info string.
If this behavior is de facto but not actual spec, I would say we should take that consensus and add it more explicitly here.
On the other hand, if this entire behavior is indeed de facto spec instead of actual spec, then we don’t need to have much of this conversation. We can add another tip about using eval
, évaluer
, or !
, up to the compiler?
Yes, but the spec doesn't specify any words,
It does say “word”. “First word”. So there is also an end of a word, a word break. So there can be two words.
in fact, keeping the eval directive to a single character is an advantage; it's easy for the reader to ignore.
It is reasonable to expect the writer to learn this difference, just as they must learn the other rules of fenced code blocks that impact rendering and all the other rules of Markdown.
That second case is why I think 1 punctuation character is less ideal.
Especially punctuation in that space of words, is it part of the first word (```js!
)? Is it a word on its own (```js !
)? Folks will mix that up and not see the difference.
I don't remember if the above distinction between reader and writer
I don’t think so, but I see it too.
Especially punctuation in that space of words, is it part of the first word (
```js!
)? Is it a word on its own (```js !
)? Folks will mix that up and not see the difference.
There are quite a few programming language names that end in punctuation: C--, C++, C#, F#, F*, J#, J++, M#, P′′, Q#, R++, Visual J++, X++, xBase++, Z++. None on that list end in !
, at least not yet anyway. I don't think that rules in or rules out ```js !
. It wouldn't be hard to treat ```js!
as ```js !
as long as js!
never becomes a file extension or language name. But I can also see the counter argument.
Worthy of note: A not small number of language names have more than one word, so "The first word of the info string is typically used to specify the language" has limitations, or needs clarification. Not sure what people do for code blocks of those languages today. They might be so much in the minority that no one hears them. Maybe the name is turned into a single hyphenated token, which you'd need to do for the CSS class attribute anyway.[^1] As practical as that solution may be, it feels like a hack.
I looked into how GitHub handles it[^2]. Based on a single test, it was hyphens. ``` Common Lisp
and ``` CommonLisp
both failed to produce highlighted syntax, while ``` Common-Lisp
did.
I'd say having an "info string" standard would be useful, and maybe it should be separate from CommonMark, as other formats support something similar if not identical, e.g. djot (cc @jgm). Even if said standard we open-ended as it is today, but with additions for multi-word language names and optional explicit directives for render source vs eval semantics.
[^1]: not ideal that HTML ends up being the de facto driver of de facto Markdown standards. Understandable, but not ideal.
[^2]: not ideal that GitHub/Microsoft ends up driving de facto standards. Understandable why they do it, not cool that we all roll over. Prime example is their new admonition syntax. They started with one unilaterally designed syntax, and then changed their mind and chose one even more geared toward vender lock-in, the admonition syntax used by Microsoft
not ideal that GitHub/Microsoft ends up driving de facto standards. Understandable why they do it, not cool that we all roll over. Prime example is their new admonition syntax.
I think it goes to the fact that the community, as far as I know, have gotten stuck defining any extensions. Lots of good ideas talked about but nothing ever decided. I'm thankful that jgm
has pushed forward and implemented various extension specs, such as definition lists and dollar math syntax. But nothing is ever codified by the community. 🤷
I feel like this ! is quite hard to explain to users tho. It isn’t very obvious that
```! js
is asking a tool to render/evaluate that code,```js render
,```js eval
, or```js
evaluate does seem like that, to me.
It's not too different than trying to explain that putting a !
in front of a link pulls an image.
There’s also the thing where this breaks every tool that deals with markdown that exists.
```! js
will result in<code class="!">
everywhere that doesn’t get updated. It’s going to take years for that to change on GH and many other places.
You're right, for those that don't upgrade their parser to the latest CommonMark spec won't support the additional syntax. So if you try to use the newer syntax on an older implementation, it does seem to render the code block (at least here on GH), but does not syntax highlight it. Less than optimal, but not completely broken.
I’m not going to block
```js !
, but I feel like letters are nicer looking / easier to explain.
There was quite a bit of feedback on the admonition implementation thread where some folks were upset with using English words for the syntax. That would be one reason to favor using punctuation.
Note that cmark-gfm
provides an option, --full-info-string
that Include remainder of code block info string in a separate attribute.
It doesn't seem like they have that turned on in the comments, but it might be turned on in file rendering - their rendering is different between the two cases. GitLab does have that enabled, so
```ruby something
x = 1
yields
x = 1
I had forgotten, but GitLab does support a certain syntax for this
{
"items" : [
{"a": "11", "b": "22", "c": "33"}
],
}
The language string becomes `json:table`, and we detect that and render an embedded table with the json, https://docs.gitlab.com/ee/user/markdown.html#json
Of course, like any renderer, if you don't know what the language is, the code block is still rendered. Not saying this is better than maybe ```` ```json table ````
I also wonder if it wouldn't be better to adopt jgm's attribute syntax and leverage that for providing these extra abilities, https://github.com/jgm/commonmark-hs/blob/master/commonmark-extensions/test/attributes.md
IMHO, I like the idea of the "Generative code" both block and inlines. Nevertheless, I would prefer $2 + 2${asciimath}
for tagging which language is used for the generation of the element. Moreover, I would also like to consider $2 + 2${python}
as generated content, even though in this case the content is interpolated as "4" and it is plain text. This could also pave the road to do things like
$${python}
import matplotlib.pyplot as plt
import numpy as np
plt.style.use('_mpl-gallery')
# make data
np.random.seed(1)
x = 4 + np.random.normal(0, 1.5, 200)
# plot:
fig, ax = plt.subplots()
ax.ecdf(x)
plt.show()
$$
as more people get used to the idea that code between dollar signs returns something that is not the code, but generated content.
This proposed change defines "display blocks" and "display spans" that are meant to process their contents for rendering in some way, rather than being displayed as raw text.
Motivation
Math support is a discussion that seems to come up quite frequently; pandoc has an extension, StackExchange supports math expressions via MathJax, and both GitHub and GitLab support them too. GitHub also added support for Mermaid diagrams somewhat recently. This illustrates the need for a CommonMark specification to guide further implementations.
This PR defines "display blocks" that follow the same definition rules as code blocks but are intended to render their content into a display form, rather than a verbatim representation. By default these should process the data as TeX and output MathML, but the info string can be used to change the renderer to something like
asciidoc
,mermaid
,graphviz
, orsvg
.