ariabuckles / simple-markdown

JavaScript markdown parsing, made simple
MIT License
516 stars 102 forks source link

Return start/end positions on AST nodes #88

Open jamiebuilds opened 4 years ago

jamiebuilds commented 4 years ago

It would be really nice if AST nodes had start/end positions for every node:

parse("Hello **World**")
[
  {
    type: "text",
    content: "Hello",
    loc: { start: 0, end: 6 },
  },
  {
    type: "strong",
    content: [
      {
        type: "text",
        content: "World",
        loc: { start: 8, end: 13 },
      }
    ],
    loc: { start: 6, end: 15 },
  },
]

Right now it's impossible to reconstruct this data from just the AST.

ariabuckles commented 4 years ago

Hey @jamiebuilds ! I follow you on twitter! (also interviewed to join the discord product infra team ~6 months ago but sadly got a really good offer from another place that felt like a slightly better fit. Sad to miss the opportunity to work with you though!)

This has been something I've been interested in for a while (it's always tough to carve out time to work on this, so I've got drafts of thinking about this from quite a while back).

There's a bit of a challenge here because the rules don't by default return where inside them their content starts. As far as I'm aware there's not a great API for finding out, from a regular expression, where a capturing group starts in the source string:

> /^\*\*((?:\\[\s\S]|[^\\])+?)\*\*(?!\*)/.exec("**World**")
[
  '**World**',
  'World',
  index: 0,
  input: '**World**',
  groups: undefined
]
> 

(It looks like the groups property for named groups doesn't capture that either.)

We could definitely add the ability to specify the index where the start of the result is outside of the regex, and maybe use some indexOf hacks for matcher functions that don't return that information.

My guess is maybe adding another function to a rule, like location: (capture: Capture) => { start: number, end: number }, and then in the parser we check if a state flag is turned on and if so, call loc or use a default based on indexOf and .length (or return null?).

Of course, doing that in a way that doesn't increase the bundle size a bunch is tricky, but I've been needing to do some optimizations there for a while anyways.

tbh I'm not coping super well with COVID and the world right now—i'm managing about enough daily energy to do work and acquire food, and that's... basically it. If you're interested in trying to prototype something for this I'd be happy to help with guidance/code reviews. Otherwise I'll try to prioritize this soon, but probably won't be able to get to it for a while, as it's a pretty significant task.

jamiebuilds commented 4 years ago

Hey @ariabuckles!

I've experimented with the some of this. I built a small editor on top of Slate that uses some of the same code structures as in simple-markdown.

Screen Shot 2020-06-18 at 11 24 30 AM

In that editor, I apply all the styles as column start-end Range's as a sort of decorator.

I ended up using the same sort of parse method as SM, but for the same reason you mentioned about capture groups not getting indexes, I added an offset param to help track the source positions.

export let underlineRule: Rule = {
    kind: "underline",
    match: /^__((?:\\[\s\S]|[^\\])+?)__(?!_)/,
    parse: (match, parse) => {
        parse(match[1], 2) // 2nd param is "offset"
    },
}

You can see the code here: https://gist.github.com/jamiebuilds/b653a755b17d5e39ae5f545e251bf08f

In that code, I also found some ways to rely on recursion to reduce the amount of code needed (going back to the parse() method.

Also, if you do ever want to revisit Discord, we can totally do that. It's been a fantastic place to work for me, and I'm happy to help get you started with that if you'd ever like to.

Please take care of yourself, open source can wait.