codemirror / dev

Development repository for the CodeMirror editor project
https://codemirror.net/
Other
5.68k stars 361 forks source link

Markdown highlighter #158

Closed RudeySH closed 3 years ago

RudeySH commented 4 years ago

The documentation only mentions the JavaScript, CSS and HTML highlighters. Is there no markdown highlighter yet?

leeoniya commented 4 years ago

@jdbruxelles this repo is for the next version of Codemirror (v6), which is a complete rewrite of v5. those docs are for v5.

jdbruxelles commented 4 years ago

Sorry, I just noticed and deleted my answer. But you were too fast :-D

curran commented 4 years ago

Markdown support would need to come from Lezer. Currently the Lezer Grammars do not include Markdown. I have created a new issue in Lezer: Markdown Grammar.

marijnh commented 4 years ago

I doubt Lezer is suitable for parsing Markdown. For languages like this, which are not really expressible as context-free grammars, we'll need to use the stream-syntax package. Porting the old CodeMirror 5 markdown mode to that shouldn't be too hard, though that code is a bit of a mess too (which isn't surprising, given how tricky parsing Markdown is).

curran commented 4 years ago

That's fascinating! I had no idea that Markdown is not really expressible as a context-free grammar. An interesting related read: https://roopc.net/posts/2014/markdown-cfg/ .

RudeySH commented 4 years ago

Would it be possible to support something like highlight.js?

craftzdog commented 4 years ago

I wonder remark-parse would be suitable as a Markdown parser for CodeMirror 6 because it is flexible and extensible. Cc: @wooorm

andy0130tw commented 4 years ago

Seems that someone had written a CommonMark grammar in tree-sitter. I also use it to power a CodeMirror 6 editor as a PoC (without incremental parsing). But the tree-sitter code is arcane, so I started my version here by following his ideas, but in Lezer.

marijnh commented 4 years ago

The idea of trying to parse CommonMark with an LR parser is very scary... but if it works, that would be really helpful.

andy0130tw commented 4 years ago

Without complete reviewing, I guess the intent is (ab)using the custom scanner to tokenize nearly everything. Most CommonMark implementations I have seen use at least two passes to first identify block structures and then parse text within them. Markdown editors, to my knowledge, often utilize this fact to re-parse and render only needed blocks in the second pass. Doing so in a single pass or even encapsulated as a single scanner function is ... rather astonishing.

My plan is using a nested grammar to parse inline text. But the efficiency of LR parsing is likely to be lost.

tjquillan commented 4 years ago

Admittedly not very knowledgeable on this but could this be of help: https://github.com/remarkjs/remark/tree/master/packages/remark-parse?

cben commented 4 years ago

Real markdown implementations tend to (1) have many extensions (2) be processed (preview/export) in same web app.

Therefore, it'd be exciting if we could wire one of the major implementations e.g. markdown-it directly into CodeMirror :pray: But I don't know if any of the major parser libraries has an incremental interface, plus AST->source position mapping...

There is also a complication that during editing you have incomplete constructs, and want different error handling from final parsing. E.g. the spec says foo *bar is not italic, it's just literal *; but during editing you'd want it highlighted italic as it's likely to be closed later. Or maybe you want that only as long as cursor is in same line/paragraph. So ideally you need a parser that can output "this is either literal *bar OR italic bar, I'm not sure yet."

I wonder if anybody has experimented with horizontally flipped PEG grammars, so that Packrat/Pika memoization can apply to left prefixes? [Confession: I hadn't studied Lezer yet, nor read the Pika paper.] EDIT: but see caveats in https://marijnhaverbeke.nl/blog/lezer.html that anything memoizing state per input character would be inherently inefficient.

cben commented 4 years ago

About 2-pass block structure, then inline structure: this also fits with increasing use of fenced blocks to embed different syntaxes inside markdown (LaTeX, mermaid diagrams, etc.).

Does codemirror 6 have an official approach to "multiplex"/compose multiple parsers in one editor? Ideally with external parser determining the boundaries of the inner one?

marijnh commented 4 years ago

Lezer has a concept of nested grammars, but since it probably can't reasonably handle Markdown, that solution doesn't apply here.

curran commented 4 years ago

I'm curious what might be the best way to approach adding Markdown syntax highlighting?

Might it be possible to port the solution from CM5?

marijnh commented 4 years ago

Yes, I think that's the most realistic approach for the moment.

That code—the CM5 Markdown mode—is really bad though, so if anyone has the time (and the patience for reading the CommonMark spec), it would be worthwhile to clean it up or even rewrite it.

curran commented 4 years ago

Looks like pretty solid tests https://github.com/codemirror/CodeMirror/blob/master/mode/markdown/test.js

Perhaps those tests can be preserved and the internals can be re-done based on those tests.

I'm curious to dig a little deeper, though, what is it that seems bad about the CM5 implementation? Asking so we can avoid that kind of badness in the CM6 port :)

wooorm commented 4 years ago

I was pinged earlier in this thread an am currently working on a new markdown parser. Markdown is complex, as noted earlier on. I don’t know the internals of code mirror. But micromark might be of interest for someone working on this.

marijnh commented 4 years ago

what is it that seems bad about the CM5 implementation?

It's grown organically from an initial simplistic implementation through patches from various contributors, without anyone really taking responsibility for it (I didn't write the initial version, only fixed some bugs). As such, it's inconsistent and messy and I'm pretty sure there is code in there that isn't even doing anything anymore.

marijnh commented 4 years ago

micromark's stream functionality does look interesting. The interface isn't documented enough for me to be sure, but it might be possible to build a streaming CodeMirror mode on top of that, if we add a way to copy a stream state (for reuse of a partial parse when content in the middle of the document changes).

curran commented 4 years ago

That would be amazing to build a mode where micromark is a dependency, rather than re-doing all the hard parser work.

wooorm commented 4 years ago

API isn’t documented enough indeed. Working on extensions and integrating it into existing stuff first right now, so that it can be properly tested.

Proper streaming of markdown is impossible, because the last line of a document ([x]: y) can seriously impact how the first line compiles ([x]). But the micromark streaming interface at least does the block parsing when stuff is passed, and inline when complete, so there is some benefit.

The big problem with markdown is that those references aren’t just two paths: a) [x] or b) <a>...</a>. As there are more potential paths (such as for [[x]](https://example.com)) where parsing is completely different. I’ve been thinking of adding an optional more “stable” version of link references to micromark, which will help streaming and hence might help codemirror too.

marijnh commented 4 years ago

Proper streaming of markdown is impossible, because the last line of a document ([x]: y) can seriously impact how the first line compiles ([x]). But the micromark streaming interface at least does the block parsing when stuff is passed, and inline when complete, so there is some benefit.

Ah, that doesn't sound promising. For the editor mode, we don't need rendering, just (accurate) tokenizing, and we do need either full streaming or incremental re-parsing.

wooorm commented 4 years ago

The tokenizing is accurate: every character is accounted for and labeled. Full streaming while fully being compliant to commonmark is impossible. But, steaming with 99% CM compliancy with the link idea I mentioned just now, is possible if you need tokens instead of an output buffer (e.g., HTML).

I don’t know enough about incremental reparsing and how that works here to advise on that, but that may just be possible. Everything is implemented in mm as if it was streaming. I’m assuming it’s fine enough to figure out a block or so before someone is editing, and parse a bit from there?

andy0130tw commented 4 years ago

As for incremental parsing, I found ToastMark promising; it is built upon CommonMark's reference implementation targeting for a code editor already. I doubt it can be easily wrapped up as a CodeMirror syntax service though.

curran commented 3 years ago

Building an MDX parser would be an interesting challenge. https://mdxjs.com/

# Hello, *world*!

Below is an example of JSX embedded in Markdown. <br /> **Try and change
the background color!**

<div style={{ padding: '20px', backgroundColor: 'tomato' }}>
  <h3>This is JSX</h3>
</div>

Possible to compose a JSX and Markdown parser into one?

wooorm commented 3 years ago

MDX and micromark are maintained by the same folk, so there would be no challenge at all ;)

linonetwo commented 3 years ago

It sounds pretty hard to write LR Grammar? Will it be even harder to write TiddlyWiki wikitext parser? Which includes HTML and wikitext markup and macros...

I want to create a WYSIWYG editor for it.

marijnh commented 3 years ago

(I'm currently experimenting with writing a Markdown parser as a hand-written incremental parser that emits Lezer trees. This is an attempt to allow nesting of other parsers, for HTML and code blocks, while still being incremental throughout.)

GeorgeNance commented 3 years ago

@marijnh Any updates on this? 😄

marijnh commented 3 years ago

Yes, I've written it, and it seems to work well, but getting nested parsing across parser systems to work requires some updates to the lezer interface, and I'm still working those out before I put out a new release of the various affected packages. I really really hope to have that done later this week, but it's not always easy to know in advance how quickly that kind of system design work will fall into place.

marijnh commented 3 years ago

I just released 0.15.0 with a first version of a lang-markdown package.

marijnh commented 3 years ago

Closing this—feel free to continue the conversation or open new issues for specific problems in the parser.

marijnh commented 3 years ago

Possibly relevant for people who subscribed to this: the markdown parser now supports extensions to add custom syntax.

RonaldTreur commented 3 years ago

Awesome news @marijnh, much appreciated! Looking forward to dive into this!

Possibly relevant for people who subscribed to this: the markdown parser now supports extensions to add custom syntax.

craftzdog commented 3 years ago

Incredible!!

Kamahl19 commented 5 months ago

@marijnh Would it be possible to support MDX via markdown parser extension?