TheCedarPrince / NoteMate.jl

Tools for working with your own knowledge base
MIT License
11 stars 1 forks source link

Using `Pandoc.jl` for AST modifications #33

Open kdheepak opened 12 months ago

kdheepak commented 12 months ago

Awesome job with the project! I really like having my markdown files in Pandoc compatible syntax.

Shamless plug, but if you use Pandoc.jl you can:

  1. avoid having to maintain a Markdown parser
  2. get support for any file format that Pandoc supports (rst, org, etc)
  3. write Pandoc extensions or filters (or transformations) in Julia

There's also https://github.com/JuliaDocs/MarkdownAST.jl for completeness-sake. Which is what Documenter.jl uses. It would be nice to integrate MarkdownAST.jl and Pandoc.jl.

This will allow users to experiment with Franklin, Quarto and Documenter for different purposes seamlessly.

TheCedarPrince commented 12 months ago

Thanks for the kind words @kdheepak !

I quite like the looks of Pandoc .jl but wasn’t quite familiar with it. I should say, the Markdown parser for NoteMate isn’t so much a “full” parser in the sense of perhaps MarkdownAST. Instead, it strictly parses markdown notes that follow the Open Knowledge Model.

My question back to you is where were you thinking about this could be integrated into the package? Here’s the file that contains the parsing for markdown right now: https://github.com/TheCedarPrince/NoteMate.jl/blob/main/src/markdown/parser.jl

Separately, I’ve been really wanting to use @mortenpi’s markdownast package — I think it looks fantastic! KD, where do you think that Pandoc fits alongside MarkdownAST?

P.S. Morten, this is the package that I’ve been talking about for months to you in intermittent messages! I’d be curious if you have any thoughts here if you have time. Thanks!

kdheepak commented 12 months ago

I think there's a few ways you can use Pandoc.jl.

One way is that you can use it to get structured data from the Markdown file:

image

So you could use the above code instead of regex like this:

https://github.com/TheCedarPrince/NoteMate.jl/blob/bfd1737e980edccb0c5f0ce6eb028fce62bfb53d/src/markdown/parser.jl#L22-L25

( I'm planning to add a walk function to traverse the Pandoc AST so it'll be easier to extract specific types, so the above code in the screenshot is just an example )

You can do something similar for links, metadata etc.

Another thing that Pandoc.jl allows is converting from one file format to another. I see you have another issue about going back to the source. You could potentially use this package to write back out to a NoteMate specific format.

Separately, I’ve been really wanting to use @mortenpi’s markdownast package — I think it looks fantastic! KD, where do you think that Pandoc fits alongside MarkdownAST?

I'm sure you are aware of this, but just so we are on the same page, pandoc can convert any input file (Markdown, RestructuredText, OrgMode files etc) to a JSON. Here's a screenshot of what that looks like:

image

Pandoc.jl is a representation of Pandoc's JSON schema in Julia structs.

My understanding is that MarkdownAST.jl is for dealing with Markdown files in general, and built to make it easier to parse Documenter.jl markdown files or Julia Markdown documentation and I'm sure @mortenpi will be able to add more. I only learnt about MarkdownAST.jl during JuliaCon, so haven't looked into it too much.

There's a lot of overlapping structs between Pandoc.jl and MarkdownAST.jl. Ideally, we'd be able to convert a Pandoc json to MarkdownAST and then we'd be able to use pandoc files for Documenter.jl or anywhere that MarkdownAST is used. I haven't explored this enough. I also don't know enough about the different kinds of structs required to represent markdown files or other files, but Pandoc's JSON schema has been around for almost a decade, so I tend to use it for any Markdown related parsing.

mortenpi commented 12 months ago

There's a lot of overlapping structs between Pandoc.jl and MarkdownAST.jl. Ideally, we'd be able to convert a Pandoc json to MarkdownAST and then we'd be able to use pandoc files for Documenter.jl or anywhere that MarkdownAST is used.

This seems like the correct abstraction here. MarkdownAST is about representing documents as Julia data structures, working with them on a data structure level, and communicating them between different packages. So it needs different parsers like Markdown or CommonMark or Pandoc to actually read (and write) the text files.

For NoteMate, I also agree that going through an AST (Pandoc or MarkdownAST), rather than maintaining custom regexes etc., would make the code simpler and hopefully easier to maintain? Although it does mean that for the stuff you're trying to find in the Markdown files, the parser then needs to have support for it (which I am not sure how true that is).

@kdheepak Sorry, I have not had time to dive into Pandoc.jl yet. But do you think it could maybe take advantage of the MarkdownAST Node type as the basis of the AST? Pandoc seems to have more types of nodes, so it would need its own elements, but I assume the basic AST tree is essentially the same?

( I'm planning to add a walk function to traverse the Pandoc AST so it'll be easier to extract specific types, so the above code in the screenshot is just an example )

Would an AbstractTrees interface, instead of a custom walk function, work well enough here? With MarkdownAST, the citation finding example might look something like this:

import MarkdownAST, AbstractTrees
md = ... # MarkdownAST tree from some parser
for node in AbstractTrees.Leaves(md)
    # Pandoc.Cite here I assume is a custom <: MarkdownAST.AbstractInline defined by Pandoc.jl,
    # extending the MarkdownAST element set. I also assume that it can't have child nodes.
    node.element isa Pandoc.Cite || continue
    do_stuff_on_citations!(node)
end
kdheepak commented 12 months ago

But do you think it could maybe take advantage of the MarkdownAST Node type as the basis of the AST? Pandoc seems to have more types of nodes, so it would need its own elements, but I assume the basic AST tree is essentially the same?

I didn't think about that before but I think that could work! I'll look into it!

Would an AbstractTrees interface, instead of a custom walk function, work well enough here?

Neat idea! I'm going to do that instead! Thanks for the suggestion.

TheCedarPrince commented 12 months ago

I think there's a few ways you can use Pandoc.jl.

One way is that you can use it to get structured data from the Markdown file:

image

So you could use the above code instead of regex like this:

https://github.com/TheCedarPrince/NoteMate.jl/blob/bfd1737e980edccb0c5f0ce6eb028fce62bfb53d/src/markdown/parser.jl#L22-L25

( I'm planning to add a walk function to traverse the Pandoc AST so it'll be easier to extract specific types, so the above code in the screenshot is just an example )

You can do something similar for links, metadata etc.

Another thing that Pandoc.jl allows is converting from one file format to another. I see you have another issue about going back to the source. You could potentially use this package to write back out to a NoteMate specific format.

Oh wow this is indeed so much nicer. I could probably refactor away so much of the package's markdown parser. I see how this connects to your following note:

  • get support for any file format that Pandoc supports (rst, org, etc)

I'd love to get support for all those file types off the bat! Plus, bidirectional conversions would be awesome a la issue #8.


For NoteMate, I also agree that going through an AST (Pandoc or MarkdownAST), rather than maintaining custom regexes etc., would make the code simpler and hopefully easier to maintain? Although it does mean that for the stuff you're trying to find in the Markdown files, the parser then needs to have support for it (which I am not sure how true that is).

Thanks for commenting on this @mortenpi ! I agree, I can't replace everything with either Pandoc.jl or MarkdownAST.jl but I can worry less about the low-level extraction in NoteMate by using these tools. I'll probably try out Pandoc.jl first for a replacement.


As a proof of concept, I am curious if we could use Pandoc.jl and MarkdownAST.jl together somehow within NoteMate.jl to explore compositions between the two packages. Markdown has been my priority with NoteMate as of now and as I am going to try replacing the "basement" functionality of the package as is, I could try out compositions. What do you think @kdheepak and @mortenpi ?

kdheepak commented 12 months ago

If I had to guess, packages like NoteMate.jl can get away with just using Pandoc.jl; and Pandoc.jl will have to integrate with MarkdownAST.jl the way @mortenpi mentioned above.

mortenpi commented 12 months ago

and Pandoc.jl will have to integrate with MarkdownAST.jl the way @mortenpi mentioned above.

A very first draft here could be a simple set of conversion functions from Pandoc.jl's AST to MarkdownAST, something like: https://github.com/MichaelHatherly/CommonMark.jl/pull/56