Abandon or Modify ElementTree?

waylan commented 9 years ago

The short version:

As part of version 3.0 (see #391) should Python-Markdown perhaps abandon ElementTree for a different document object like Docutils' node tree or use a modified ElementTree for internally representing the Parsed HTML document?

Any and all feedback is welcome.

The long version:

Starting in Python-Markdown version 2.0, internally parsed documents have been represented as ElementTree objects. While this mostly works, there are a few irritations. ElementTree (hereinafter ET) was designed for XML, not HTML and therefore a few of its design choices are less than ideal when working with HTML.

For example, by design, XML does not generally have text and child nodes interspersed like HTML does. While ET provides text and tail attributes on each element, it is not as easy to work with as it would be if the text was contained in child "TextNodes" (much like JavaScript's DOM). Additionally, ET nodes have no knowledge of their parent(s), which can be a problem in certain HTML specific situations (some elements cannot contain other elements as children or grandchildren or great-grandchildren...).

I see two possible workarounds to this: Modify ET or use a different type of object.

Modifying ElementTree

We already have a modified serializer which gives us better HTML output (it is actually a modified HTML serializer from ET) and we already import ET and document that all extensions should import ET from Markdown. Therefore, if we were to change anything (via subclasses, etc) those changes would propagate throughout all extensions without too much change.

In fact, some time ago, I played around with the idea of making ET nodes aware of their parents. While it worked, I quickly abandoned it as I realized that it would not work for cElementTree. However, on further consideration, we don't really need cElementTree (most of the benefits are in a faster XML parser which we don't use).

Interestingly, in Python 3.3 cElementTree is deprecated. What actually happens is that ET defines the Python implementation and then at the bottom of the module, it tries to import the C implementation, which upon success, overrides the Python objects of the same name. What is interesting about this is that the Python implementation of the Element class (ET's node object) is preserved as _Element_Py for external code which needs access to it (as explained in the comments).

I envision a modified ET lib to basically subclass the Python Element object to enforce knowledge of parents for all nodes. Then a TextNode would be created which works essentially like Comments work now:

def TextElement(text=None):
    element = Element(TextElement)
    element.text = text
    return element

The serializer would then be updated to properly output TextElements. In fact, at some point, the serializer might even be able to loose knowledge of the text and tail attributes on regular nodes. However, that last bit could wait for all extensions to adopt the new stuff.

In addition to TextElement we could also have RawTextElement and AtomicTextElement. Both would be ignored by the parser (no additional parsing would take place). However, a RawTextElement would be given special treatment by the serializer in that no escaping would take place (raw HTML could be stored inline in the document rather than in a separate store with placeholders in the document), whereas an AtomicTextElement would be serialized like a regular TextElement.

The advantage of an AtomicTextElement (over the existing AtomicString) is that a single node could have multiple child text nodes. Today, each node only gets one text attribute. Therefore, when a AtomicString is concatenated with an existing text string, we lose the 'atomic' quality of the sub-string. However, with this change each sub-string can reside in its own separate text node and maintain the 'atomic' quality when necessary.

Using Docutils

Rather that creating our own one-off hacked version of ET, we could instead use an already existing library which gives us all of the same features (and more). Today, the only widely supported and stable library I'm aware of is Docutils' Document Tree. While the Document Tree is described as an XML representation of a document, Docutils provides a Python API to work with the Document Tree which is very similar to the modified ET API I described above (known parents, TextElement, FixedTextElement...). Unfortunately that API is not documented. Although, the the source code is easy enough to follow.

Until recently, I was of the assumption that to implement something that used Docutils, one would need to define a bunch of directives (etc) which more-or-less modify the ReST parser. However, take a look at the Overview of the Docutils Architecture. A parser simply needs to create a node tree. In fact, the base Parser class is only a few simple methods. The entire directives thing is in a separate directory under the ReST Parser only. Theoretically, one could subclass the base Parser class, and build a node tree using whatever parsing method desired and Docutils wouldn't care.

For that matter, Python-Markdown would not have to replicate Docutils "Parser" API. We could just use the node tree internally. As a plus, this would give us access to all of the built-in and third party Docutils writers (serializers). In other words, we would get all of Docutils output formats for free.

Additionally, Docutils' node tree also provides for various meta-data to be stored within the node tree. For example, each node can contain the line and column at which its contents were found in the original source document. This provides an easy way for a spellchecker to run against the parser and report misspelled words in the document without first converting it to HTML, among other uses which do not require serialized output.

No, this would not make Python-Markdown suddenly able to be supported by Sphinx. Sphinx is mostly a collection of custom directives built on top of the ReST parser. ReST directives do not make sense in Markdown. However, we could convert Markdown to ReST as many other third party parsers convert various formats to ReST via a ReST writer. There is also at least one third party writer which outputs Markdown from a node tree. By adopting Docutils node tree, Python-Markdown could become part of an ecosystem for converting between all sorts of various document formats (an expandable competitor to Pandoc?).

The downsides to using Docutils are that we are then relying on a third party library (up till now, Python-Markdown has not) and all extensions would absolutely be forced to change to support the new version. It is also possible that we wouldn't be able to use the available HTML writer as the default because of some inherent differences with Markdown and ReST (ReST is much more verbose and we might need to hack the node tree or the writer to get the writer to output correct HTML from a Markdown perspective -- I have not investigated this).

As it stands now, there are various small changes required of extensions between version 2 and 3, but I expect that most extensions would be able to support both without much effort. If we went with Docutils, that would no longer be the case.

Or, maybe this whole thing is a bad idea and we should just continue to use ET as-is.

Any and all feedback is welcome.

mitya57 commented 9 years ago

No, this would not make Python-Markdown suddenly able to be supported by Sphinx. Sphinx is mostly a collection of custom directives built on top of the ReST parser.

Why not? The main feature of Sphinx is not directives, but beautiful themeable output (HTML, LaTeX, etc). I think that if we start using Docutils document tree, it will be possible to add Markdown support to Sphinx.

mitya57 commented 9 years ago

By the way, here is an attempt by someone else to write a Markdown parser for Docutils.

waylan commented 9 years ago

By the way, here is an attempt by someone else to write a Markdown parser for Docutils.

Yes, I was aware of that. However, it does not offer the kind of extension API we do and it is based on a PEG Parser [^1]. With our extension API, support for Docutils could really be something.

Regarding Sphinx, if it turns out to work, great, but that is not at all a goal of mine at this time.

[^1]: While PEG parsers are great, there are a few nuances of the Markdown Syntax that they don't handle well and I don't consider them a good fit. Note that the most well know PEG parser for Markdown was written my John McFarlane (which all others seem to be based on) and he has since written other Markdown parsers, including ComonMark (which he also wrote the spec for) and he has chosen to not use a PEG parser for any of them.

facelessuser commented 9 years ago

I guess that is the real question: does the extra Docutil possibilities sway you enough to use it? It sounds like you will have to do some customizing whether you use Docutils or ET, but you would have more control over say ET. But I suspect most users would say go Docutils. If you tell them you have an option that gives more potential features and one that does not, I suspect most will pick the former :smile:.

waylan commented 9 years ago

I did some work mapping the Markdown Elements (as they are output in HTML) to Docutils Node Elements. I have made no effort to map anything used by extensions. This is solely the base syntax. This is what I have come up with before doing any testing:

MD Element	Docutils Element
`<p>`	paragraph
`<h1-6>`	nested section with child title
`<blockquote>`	block_quote
`<ol>`	enumerated_list (interesting attrs: enumtype, start)
`<ul>`	bullet_list
`<li>`	list_item
`<pre>`	literal_block
`<hr>`	transition
`<br>`	line_block with children [line](`<br>` is implied between line elements)

`<a>`	reference
`<emphasis>`	emphasis
`<strong>`	strong
`<code>`	literal
`<img>`	image

References:	target
Raw HTML:	raw

A few things to note:

Headers (<h1-6>) are weird. In Docutils, you have sections and titles. The first child of a section is the title. Then all non-header elements following the header are children of that section. If a lower numbered header occurs, then is also is a child of the first section and any elements after is are children of that section. You only break out of a section when you have a higher level header. You cannot skip header levels either. This does not translate well to standard Markdown. Interestingly, a few popular third-party extensions have replicated this nested section behavior. Apparently some people find it more useful when styling their docs.

Hard lines breaks (<br>) are odd as well. The entire block must be contained in a line_block element (which would presumably be a child of a paragraph). The line_block element then contains children line elements. The line break is implied between each line element.

A target element holds data defined anywhere within the document (like link or image references in Markdown). A later transform would then copy the URL stored in the target to the corresponding reference or image element.

Raw HTML should go in a raw element. However, I need to confirm that the HTML writer will not wrap the raw text with any HTML tags.

I also need to confirm that literal_block and literal elements get wrapped with the proper HTML tags (<pre><code> and <code> respectively) by the HTML writer.

Images are interesting as Docutils allows an image to either be an inline element or a block level element, which is supported by the HTML spec. However, in Markdown, images only ever appear as inline elements. Shouldn't be a problem. Just need to be careful.

Everything else should be pretty straightforward.

All Docutils elements can be imported from docutils.nodes.<elementname>.

waylan commented 9 years ago

Given the situation with headers (<h1-6>) and Docutils sections, I'm thinking that docutils is a non-starter. Not only is the internal structure a bunch of nested sections, but the HTML output mirrors that structure with a bunch of nested divs. This is not valid Markdown output. I could create my own element subclasses which emulate Markdown's headers, but then Docutils writers won't work any more, which is the main benefit of using Docutils.

I should note that because each header generates a nested section, Docutils does not allow header levels to be skipped. In fact, it will crash hard on inconsistently nested levels. Markdown should never crash hard on bad input.

Another issue is that there is no hard rule for which characters in the ReST syntax represent which level. It is simply assumed that they appear in the order they are found (the inconsistency comes when you step back up, then down again -- it is assumed that you use the same pattern going back down). Therefore, the first header is always a level 1 header (<h1>). However, in Markdown the levels are explicit in the syntax. If a user starts with ### Header, then that first header in the document must be level 3 (<h3>). Docutils has no mechanism for retaining that info, or at least, the writers have no mechanism for accounting for it.

Given the above, I don't think I'll be pursuing the use of Docutils at this time. However, I think reviewing it had been beneficial. It is made it more clear how I want to modify/subclass ElementTree.

A benefit of using ElementTree is that the changes can be made more incrementally, running tests as I go, which feels much less intimidating.

facelessuser commented 9 years ago

Yeah, I was going to comment that when I saw the wacky header tag stuff that I was thinking it was looking less desirable. But seeing that it can actually break things makes it quite a bit worse considering Python Markdown's goals.

waylan commented 9 years ago

Over the past week I've been slowly putting together an altered ElementTree lib which has ended up re-implementing almost everything (only ElementTree's XPath is used which we don't really use anyway). It is hardly close to done, but it occurred to me that I was more-or-less re-implementing Beautiful Soup's document object.

The one weird thing about Beautiful Soup is that you can't create a document unless you parse something. So to create an empty document which elements can be added to, you need to parse an empty string. I assume parsing as empty string is not too much of a performance hit (I should probably confirm that), so its not too big of a deal, just weird.

Once you get past that, the API is very extensive and easy to work with. It is specifically designed for working with HTML and even gives more control that anything I would have custom built myself. Text is represented as child nodes alongside child element nodes. Every node in the document tree knows about its parents, siblings, children, etc. Methods are provided on each node to insert, insert_before, insert_after, append, and the list go on...

I should note that I would be using version 4 of Beautiful Soup (which breaks compatibility with the more popular version 3, but is necessary to get Python 3 support). Unlike earlier versions, Beautiful Soup 4 is not an HTML Parser. Is simply wraps existing third party parsers (lxml, html5lib, and Python's HTMLParser) and provides an easy-to-use API for accessing and manipulating the parsed document. Python-Markdown's use case is manipulating/building an HTML document, so the goals align fairly well.

As the home page states: "Beautiful Soup is licensed under the MIT license, so you can also download the tarball, drop the bs4/ directory into almost any Python application (or into your library path) and start using it immediately." That was not case with earlier versions. Although, there is the issue that the 2to3 tool needs to be run for Python 3, so just copying it into the Markdown lib doesn't make much sense. But it can be listed as a dependency and get installed automatically by the setup script as long as an internet connection is available.

An added plus is that Beautiful Soup comes with its own serializer which is built specifically for HTML (with pretty-printing build in). Although we would loose the ability to distinguish between HTML and XHTML, the only difference in Markdown's syntax is <br> verses <br />. As both are valid in HTML5, I don't think that really matters anymore. The reference implementation (markdown.pl) outputs XHTML and Beautiful Soup outputs <br /> for HTML output, so that seems like a reasonable compromise. We had added the HTML output format before HTML5 was even a thing so users could use Markdown's output in HTML4 documents. If it is a problem, we might be able to use a custom output formatter to address the issue (the formatter keyword will accept a callable in addition to the string names of the built-in formatters).

Any thoughts?

mitya57 commented 9 years ago

Can we change the BeautifulSoup API (i.e. submit a pull request) to make it easier to create an empty document? So that we can simplify our code, at least in the future?

waylan commented 9 years ago

Can we change the BeautifulSoup API (i.e. submit a pull request) to make it easier to create an empty document?

I looked into that (although I have not submitted or requested anything upstream). As I understand it, the assumption is that any BeautifulSoup document is always assumed to have been created by a parser and that assumption is interwoven throughout the code. For example, the serializer checks which parser was used for various branches in its behavior (eg: output <br /> or <br> depending on which parser was used).

In fact, if you create a fragment and try to serialize it, it will crash hard. It needs to be contained in a "document" object, which is a special Node which holds reference to the parser, among other things. When serializing a child, it looks up the tree to the document root for various data to determine how it should behave. I tried creating a subclass of that document root class which statically sets all the moving parts on the document root, but I am still getting weird errors I can't seem to figure out.

waylan commented 9 years ago

Apparently, I didn't make a copy of my attempt to override the default BeautifulSoup document root class (with one that skips the parsing step and sets defaults). However, I did do this. As I explain in the comments:

Just maybe, the way to parse HTML within a Markdown document is to run the document through an HTML Parser first. Some parsers, like the HTMLParser included in the Python Standard lib will properly parse the plain text not wrapped in HTML tags as plain text and simply return it unaltered. The problem is with Markdown's autolinks (<foo@bar.com> and <http://example.com>).

However, as of Python 2.7.3 and 3.2.2, the HTMLParser can now handle invalid HTML without crashing hard. Below is a subclass of Beautiful Soup's HTML Parser which accounts for those autolinks and passes them through as text. The crazy idea is that those text nodes could then be parsed by Markdown and the Markdown Parser would not need to reimplement a lousy HTML parser with regex.

Perhaps it is just a crazy idea, but it might actually work. If so, the first step in parsing a document would be to pass it to the HTML parser. All of the non-HTML parts would simply be text nodes. Then, loop through those text nodes and convert them to the appropriate block level nodes, then inline nodes, etc.

If we went that way, the no-need-for-a-HTML-parser would no longer exist. What do you think? A bad idea? or brilliant?

waylan commented 9 years ago

If anyone is interest, my (failed) attempt to create a BeautifulSoup document root subclass is here.

mitya57 commented 9 years ago

Perhaps it is just a crazy idea, but it might actually work. If so, the first step in parsing a document would be to pass it to the HTML parser. All of the non-HTML parts would simply be text nodes. Then, loop through those text nodes and convert them to the appropriate block level nodes, then inline nodes, etc.

This looks like a giant hack, doesn't it?

waylan commented 9 years ago

This looks like a giant hack, doesn't it?

Yeah, your right. Moving on...

mitya57 commented 9 years ago

Well, it's up to you to make the decision. I didn't even look at the code, just read your summary, so my opinion shouldn't really matter…

lehmannro commented 9 years ago

I am not sure I understood the rationale for dismissing Docutils so quickly. It seems to me skipping header levels is pretty easy, provided that you put in the right level of sections. I have published a Gist to that effect.

The advantages you get are immense — you get a lot of interesting writers (which you also don't have to maintain, so you have that going for you) from both the Docutils and Sphinx ecosystems, the very intriguing Transform API allowing you to modify doctrees after the fact (adding figure numbering, translations, whatever), and a clean contract that this tree format was indeed made for documents, not XML.

waylan commented 9 years ago

@lehmannro the issue is that the HTML output is not valid according to the Markdown rules. According to the Markdown rules, ## Header must result in <h2>Header</h2>. If we use Docutils, we can't guarantee that that is what we will get. Additionally, because Docutils assumes each header starts a new section, it wraps each section in a div. Again, according to the Markdown rules, those divs break the rules. Yes, I understand that without any css rules, the divs have no effect on how the browser displays the HTML, but that is beside the point.

In fact, if you look at the test frameworks for the various Markdown implementations (including the reference implementation), they all contain a bunch of Markdown files with matching HTML files. The tests are run by passing the Markdown file through the parser and comparing the output to the HTML file. Even one character of difference results in a test failure. The Markdown syntax is expected to produce very specific HTML. Any significant variation from that specific HTML is an error. Markdown is very much tied to HTML. See here for why this matters.

That is very different from the approach taken by Docutils. While parts of Docutils closely mirror HTML, that is coincidental. AFAICT, from the get-go, Docutils was designed for representing a document structure regardless of the output format. Therefore, even the ReST to HTML conversion is not always the most obvious. It simply does not give us the option to output the HTML that Markdown users expect/require.

While I agree that a Markdown to Docutils tool would be very useful, it does not serve the Markdown community at large very well. As we are the leading (most downloaded from PyPI at least) implementation of Markdown in Python, unfortunately I don't think we can adopt the use of Docutils for the reasons explained above. That said, a less-mainstream Markdown implementation which supports Docutils and is upfront about the fact that it does not output the expected HTML certainly has a place in the world. As mentioned previously, such an implementation already exists and nothing it stopping anyone from creating others.

lehmannro commented 9 years ago

I'm not sure the document you linked and your statement are consistent. Look at any of the examples under What are some examples of interesting divergences between implementations? (eg. ATX headers with escapes) — they ALL have different output. From the description of the document, its purpose is to _“promote discussion of how _and whether* certain vague aspects of the markdown spec should be clarified.”* I don't think it's trying to publicly shame parsers which do not adhere to the standard (which, and please correct me if I'm wrong, is very loose as illustrated by the document.)

While I cannot stop you from implementing your favorite Markdown parser in any way you want, a statement such as “If we use Docutils, we can't guarantee that that is what we will get.” is simply FUD. There are strict (and, I would claim, stricter than in the Markdown spec) guarantees as to what Docutils does and doesn't produce. The first section's title is always a <h1>, the second gets <h2> and so on.

I don't have any experience with the “Markdown test frameworks” but if they test something that's not in the spec they are plain and simple wrong. (That being said, if stripping extraneous <div>s is a requirement for you, you could either look at the new HTML5 writer or at Transforms. I think dismissing Docutils after a —as it seems— very shallow investigation is unfair and misinforming others who would like to learn from the “leading implementation of Markdown in Python.”)

waylan commented 9 years ago

The first section's title is always a <h1>, the second gets <h2> and so on.

There is the problem. If I start a Markdown document with ### Header, then the first header must be an <h3>. This is not possible with Docutils. Docutils forces the first title to be an <h1>. And there is no way to retain that info and alter the behavior of the HTML serializer/writer (if I'm wrong, please enlighten me). This in itself is a sufficient reason to reject Docutils for Markdown.

Regarding my linking to the Babelmark2 FAQ, the point is that implementations should not differ in their output. Yes, unfortunately many do. However, we should not be creating more differences which would only make matters worse. Adopting Docutils would do just that.

waylan commented 9 years ago

if stripping extraneous <div>s is a requirement for you, you could either look at the new HTML5 writer or at Transforms.

Sorry, I missed this point the first time, so I'll address it now. AFAICT, Transforms run on the Docutils document object, which does not yet contain any <div>s (the divs are added by the writer). Therefore, Transforms would have no effect. The other option would be to create our own writer. While this is possible, I'm not that motivated to do so. Besides, given my last comment, how would the custom writer know which level to assign to each section's title as that data is not contained in the document object and is not dependent on the order or nesting of the sectionss in the document?

To make this work with Markdown, the way I would want to do it would be to not use Docutils' sections, but then how would I represent headers? According to the Docutils document spec, a title must be the first child of a section and the titles level must be determined by the nested level of the section. Therefore, the spec for the Docutils document object is in direct opposition to the Markdown rules.

As I can't use titles without sections and sections don't work for Markdown, I considered creating my own node subclass which would represent a heading (<h1-6>). Then, the Markdown parser would insert these objects into the document tree at parse time and the proper level (according to the Markdown rules) could be preserved. However, none of the existing writers would be able to use such a document and the benefits of using Docutils would be lost.

I'm open to alternate solutions here. But personally, I'm not seeing them.

lehmannro commented 9 years ago

Did you have a look at https://gist.github.com/lehmannro/2d2127b7c839282a673d which I linked earlier? It produces a <h2> without any <h1> just fine. Sure, you would have to do some bookkeeping for how many more sections you need to nest to get the user's requested header level, but that's far from impossible.

waylan commented 9 years ago

Did you have a look at https://gist.github.com/lehmannro/2d2127b7c839282a673d which I linked earlier? It produces a <h2> without any <h1> just fine. Sure, you would have to do some bookkeeping for how many more sections you need to nest to get the user's requested header level, but that's far from impossible.

In my opinion, the work (and headaches) to create and maintain code which implements that hack far outweigh the benefits. Give me an example that does not use sections (Markdown has no concept of sections), and perhaps I'll reconsider. In order words, Docutils is not an HTML document object, which is what I need.

My personal opinion is that given the very close mapping between Markdown and HTML, the best way to get from Markdown to Docutils is to do Markdown => HTML => Docutils. An HTML to Docutils tool can exist separately from Markdown and serve a much wider audience but also provide a decent way to get from Markdown to Docutils.

In fact, whatever HTML document object library Markdown uses could also have a [document object] to Docutils tool which would eliminate the need to first serialize the Markdown document and then parse the HTML into another document object. Think ElementTree2Docutils or BeautifulSoup2Docutils. Personally, I'm surprised those tools don't exist already. They could work great for converting Markdown to all of Docutils supported output formats and would serve a much broader audience as well. In fact, BeautifulSoup2Docutils would be immensely useful. You could use it to parse HTML using your choice of any of the decent Python HTML Parsers (as per BeautifulSoup's API) and could output to any of Docutils supported output formats. At that point, any markup language's lack of explicit support for Docutils would only be an optimization issue (skipping the HTML serialization and subsequent parsing would obviously be an optimization -- but offers no additional advantages that I can see).

waylan commented 9 years ago

To be clear, this is a valid Markdown document:

##### Level 5

# Level 1

### Level 3

## Level 2

#### Level 4

# Level 1

###### Level 6

I could keep going, but the point should be obvious. Keeping track of nested section levels would be a real headache when building a Docutils document. Unless someone can point me to a way to not use Docutils sections, consider the subject closed for discussion.

waylan commented 9 years ago

For completeness, I just stumbled on this project: AdvancedHTMLParser. It looks interesting, but its history is limited and I know nothing of its stability. The interesting part is the AdvancedTag object, which is both a node and self printing (using innerHTML and outerHTML). The lib more-or-less mirrors the JavaScript DOM, which may or may not be a good idea.

waylan commented 8 years ago

Just a quick update. My work on an HTML Node toolkit has stabilized. I'd do a release, except that I haven't actually used it for any real work yet. In any event, its ready to use.

However, I'm not sure I want to use it in Markdown. Now I'm thinking a simpler node structure would be preferred. Perhaps only the nodes represented in the Markdown text. For example, Markdown only has list items, but no parent ul or ol it actually represented in the document (they are only implied). So perhaps the node tree should reflect that. Each list item node could retain which type it is (alphanumeric (and value), dash, asterisk, plus...), but have no parent list node. Then, when rendering (or perhaps in some intermediary transform) the specifics could be worked out (parent node, list type, item value, etc). That should give much more control regarding the various ways that people prefer to have lists rendered and doesn't actually require any modification of the parser, only the rendering (or transform) step would need to be modified.

ghost commented 8 years ago

What about using a json-like structure? Would it be faster to process? also probably it would make it easier to create writers for new output formats. I think pandoc does something like this, at least looking at this page of their docs

waylan commented 8 years ago

@andya9 I'm thinking of something very similar to that. It would be more performant to use native Python objects, but yes, something very JSON-like. Perhaps a string representation of the document tree would even be in JSON.

ghost commented 8 years ago

I’m glad to hear that! :smiley:

waylan commented 8 years ago

Also thanks for the link to Pandoc's documentation. I have looked but never found the definition of their internal document structure before. Apparently it was more recently broken out into a separate package and they have a complete definition of the structure. Could be helpful.

ghost commented 8 years ago

I’m really glad I could be useful!

ghost commented 8 years ago

There’s also remarkjs allowing json output

waylan commented 8 years ago

remarkjs is very cool. See mdast for a definition of the Markdown Abstract Syntax Tree which it uses.

ghost commented 8 years ago

Cool indeed! Thank’you for the link, I’ll take a good look

waylan commented 8 years ago

Here is a simple JSON based AST in Python I threw togeather.

ghost commented 8 years ago

Beautiful! It should make the code much simpler, am I right? We don’t need an ordereddict anymore, if we encode line+column’s start and end inside each element object... If I understand correctly, now the parser will just need to get Content + Position (start, end; this is the trickiest probably) + Type, your Node class will use them to build a dict object, and we can immediately use it to produce our output (in any format we want!)?

ghost commented 8 years ago

I’m not skilled enough to deal with the preprocessing part, but I’d be glad to help with transforming the ast into output (dict object > html element + extensions)

daniele-niero commented 6 years ago

Just wanted to say that I support the decision to leave out Docutils. Personally I like simple, straight-forward tools that stick to a specific problem ahead, even if this means less features. I've found Docutils too convoluted and probably the reason why I hate Sphinx. Just my personal point of view.

waylan commented 6 years ago

We have decided to defer implementing this until a future release. Therefore I am removing this from the 3.0 milestone.

The reason is that out most important asset is the rich collection of extensions (both first and third party). Making a change of this magnitude would require every third party extension to undergo a complete refactor. At this time the costs outweigh the gains. However, we will re-evaluate again in the future.

mthuurne commented 5 years ago

When considering a different AST, it would be useful to revisit #215. I think it's very useful for an application to be able to get access to the AST, but the fact that currently essential post-processing happens after serialization prevents that.

Is it necessary for the post-processing to happen on the serialized output or could it, in theory, be done on the AST instead? I guess that with ElementTree as the AST, post-processing on the AST would mean you'd have to parse inline HTML and insert that into the tree as well. This would mean that HTML errors would have to be handled (escalated or repaired) instead of forwarded to the output. In my opinion that's a positive, because catching errors early is generally a good thing, but it might not be the Markdown way.

But if you'd go for a different AST implementation, you could have inline HTML as raw HTML nodes inside the AST and still post-process those textually, while post-processing all generated HTML in the AST instead.

Python-Markdown / markdown

Abandon or Modify ElementTree? #420