Python-Markdown / markdown

A Python implementation of John Gruber’s Markdown with Extension support.
https://python-markdown.github.io/
BSD 3-Clause "New" or "Revised" License
3.79k stars 863 forks source link

Extension md_in_html does not recognize tags with hyphens #1246

Open igordsm opened 2 years ago

igordsm commented 2 years ago

Web components are custom HTML components that are required to have - in their names. This breaks current HTML handling since these elements are not considered. IMHO they should be treated the same as <div> ("block" elements, if I'm not mistaken).

The following was tested in current main with the extension md_in_html active.

input

<a-b>

asdf

</a-b>

output:

<p><a-b></p>
<p>asdf</p>
<p></a-b></p>

expected:

<a-b>
<p>asdf</p>
</a-b>

I went through the code and might know how to add this, but I would like the maintainers' input before proceeding.

waylan commented 2 years ago

Web components are custom HTML components that are required to have - in their names.

Can you point us to a spec for this?

igordsm commented 2 years ago

The resource I use the most is MDN: https://developer.mozilla.org/en-US/docs/Web/Web_Components/Using_custom_elements

The actual specification of valid names is at https://html.spec.whatwg.org/#valid-custom-element-name

waylan commented 2 years ago

Thank you for the links. There are two things I need to mention here.

First of all, the way Python-Markdown handles raw HTML is to define a list of known block-level tags. Any content within those block-level tags gets special treatment. Anything outside those known block-level tags is just treated as regular Markdown content, including inline raw HTML elements, which explains the behavior of the sample provided above.

Second, I will note that to use custom elements, the HTML spec requires you to register the elements with the browser first. Without registering them, then the browser would have no knowledge of how to handle them. In fact, a custom element could be an inline element or a block-level element.

Given the above, I think that the logical way to support custom elements in Python-Markdown is to require the user to "register" the elements. That is, if you have a custom element which should be treated as a block-level element, you need to inform the Markdown class about it. This would probably make a good candidate for a third party extension (extension to register custom elements), although you can do this without an extension as demonstrated below.

>>> src = '''
... <a-b>
...
... asdf
...
... </a-b>
... '''
>>> md = markdown.Markdown()
>>> md.block_level_elements.append('a-b')
>>> md.convert(src)
'<a-b>\n\nasdf\n\n</a-b>'

That said, this does not currently work correctly with the md_in_html extension. Specifically, the extension fails to allow Markdown parsing within the element.

>>> md = markdown.Markdown(extensions=['md_in_html'])
>>> md.block_level_elements.append('a-b')
>>> md.convert(src)
'<a-b>\n\nasdf\n\n</a-b>'

This would appear to be because the extension compiles its lists of element types when the class instance is created and therefore does not see the changes made to the Markdown class latter (see relevant code here). Ideally, the extension would build its list of element types after all extensions are loaded. I'm open to a PR which makes this change only. However, I do not see any need to add explicit support for custom elements specifically.

igordsm commented 2 years ago

Thanks for the detailed feedback @waylan . I'll try and make a PR with the changes you outlined above this week.