Use DOMDocument for html parsing?

donquixote commented 10 years ago

Disclaimer: This question is mostly out of selfish curiosity.

Background: I was trying to get the best out of a number of Markdown-related Drupal modules, and ended up writing my own version of the {.className#id} mechanic. Then I went around looking at different Markdown implementations in PHP, and found this one, which is probably the best currently available (but not used in the Drupal markdown module yet).

Looking at the main Markdown class, I have two observations:

It is a clear improvement to older implementations.
The class is quite big (though smaller than other things I've seen). For my taste, further splitting it up would be a good idea. But I did not look close enough to decide if this is possible.
A lot of code is there to deal with HTML parsing. The class would be a lot smaller without that.

So, here is the idea I had. Probably you already thought about this, and decided that it won't work.

Parse the text with DOMDocument.
Walk through the DOM tree and look for text nodes. Process the text nodes for markdown syntax.

Some markdown stuff needs to look at more than one text node. E.g. bullet lists, if there is e.g. a html span in one of the list items. However, all bullets of a bullet list can be expected to be on the same level of the DOM tree.

Maybe some special handling needs to be done for html characters that are backslash-escaped..

I do not claim that this is going to be easier. In fact, I have no idea. I am only asking if there is a good reason not to do this. I also don't know yet what I will do with this information. Maybe some day I will contribute.

samdark commented 10 years ago

While it may save significant amout of code, DOMDocument is proved to be not so stable and not so fast.

cebe commented 10 years ago

The class is quite big (though smaller than other things I've seen). For my taste, further splitting it up would be a good idea. But I did not look close enough to decide if this is possible.

I have plans using traits to split out some parts that are reused in other flavors. Have only done this for table support until now. In general it all belongs to the markdown flavor so it has to be in the class. splitting up is only useful if there is a real benefit.

About DOMTree:

are you suggesting to turn everything into HTML and use domtree to parse sub elements? Or use DOM only for the HTML parts in the markdown? First thing does not work as this lib can also be used to transform markdown to some different languages like LaTeX for example. See https://github.com/cebe/markdown-latex

Using DOM to improve native HTML support could be an option. Current HTML support is not really good right now. Will check that out.

cebe commented 6 years ago

since this issue has been created the library has go an abstract syntax tree, not using DOM as it is not only HTML. Nothing to do here as far as I see.

cebe / markdown

Use DOMDocument for html parsing? #52