Python-Markdown / markdown

A Python implementation of John Gruber’s Markdown with Extension support.
https://python-markdown.github.io/
BSD 3-Clause "New" or "Revised" License
3.74k stars 858 forks source link

attr_list breaks if the value contains braces, requires HTML-encoding them #1390

Closed iBug closed 6 months ago

iBug commented 11 months ago

This issue originated from squidfunk/mkdocs-material#6177 where I was exploring the "override copied text" feature.

The following code block breaks Markdown parsing:

``` { .c data-copy="int main() { return 0; }" }
Try copying me for some C code

It doesn't quite make sense to me as everything is already inside double quotes.

Instead, HTML-encoding the braces (and you can add newlines) solves the problem:

````markdown
``` { .c data-copy="int main() {
  return 0;
&#125" }
Try copying me for some C code


Using HTML-encoding is not intuitive to most users. Maybe this could get changed to Python escapes?
waylan commented 11 months ago

Interesting issue. When the feature was implemented, we did not foresee any sort of formatted text being passed in as a value, which explains why you need to jump through hoops to make it work. For example, we might expect a class name or a value for an HTML title attribute. In either case, there would be no markup. Although, I suppose a title attribute could conceivably contain curly braces. I'll have to think about this.

Regarding the specific feature of mkdocs-material; I find it very strange that the feature even exists. I can't imagine any scenario where anyone would want that specific feature. Of course, use cases should not be limited by my imagination (or lack thereof), but it does give me pause. I will certainly not be adding specific support for newlines or other advanced formatting. The only thing I would consider here at all is to maybe allow curly braces within quoted values.

The tricky thing about this is that the initial regex does not parse the keys and values. It simply grabs the entire string (between the curly braces) and then passes that on to a separate key/value parser. We currently grab that string with \{(?P<attrs>[^\}\n]*)\} which does not allow any closing curly braces within the string, We would need to change that portion of the regex to be more greedy and match to the last closing curly brace on the line (maybe \{(?P<attrs>[^\n]*)\} or similar).

I'm curious how Superfences handles this scenario. Any input from @facelessuser or @squidfunk is welcome.

squidfunk commented 11 months ago

Thanks for the evaluation, @waylan.

Regarding the specific feature of mkdocs-material; I find it very strange that the feature even exists. I can't imagine any scenario where anyone would want that specific feature. Of course, use cases should not be limited by my imagination (or lack thereof), but it does give me pause. I will certainly not be adding specific support for newlines or other advanced formatting. The only thing I would consider here at all is to maybe allow curly braces within quoted values.

A user wanted to explicitly set the text for the clipboard button, which we implemented in https://github.com/squidfunk/mkdocs-material/issues/6086. We consider this to be highly experimental and are considering removing this functionality again, if it turns out to be complicated for users to use. If you follow the issue, you can learn that the use case is that some users have comments in their code blocks which they don't want to be copied.

If this turns out to be too involved to implement, or it can only be partially solved (e.g. curly braces are okay, but everything else needs to be escaped) we will remove the feature again, since we don't consider the necessity for escape everything with HTML entities to be very user-friendly. It's too prone to errors.

facelessuser commented 11 months ago

SuperFences uses attr_lists if enabled, to handle parsing code block headers of this sort. I don't personally ever use anything complicated that requires { or } in my attributes, nor have I had a need so far. I am generally indifferent as I don't use the data-copy feature that Mkdocs Material has implemented either.

With that said, I can see how and why someone would desire sane handling. I do think it is possible to craft a solution that could consume strings properly. But I also would probably not care if it was decided { and } handling wasn't fixed. Again, I don't stress the attribute system in this way currently. I guess I can't rule out the possibility that one day I may run into such an issue, but if I did, since HTML escapes can navigate me around the issue, I'd probably be okay for the 1 or 2 times I may ever need it.

waylan commented 11 months ago

@squidfunk thanks for the explanation. I commented on your issue with a suggestion for a different way to solve your problem. Whether you go that route or another is your choice. But the best you can hope for from us here is to allow curly braces in values. All other limitations will remain.

@facelessuser the one question I was hoping you would answer you didn't. The attr_list parser is not the issue, The issue is that fenced_code doesn't pass the curly braces on to attr_list. I was wondering if maybe superfences wasn't so restrictive.

facelessuser commented 11 months ago

@waylan It passes whatever is in the attribute to attr_list. This is no different than what using attr_list on a header does.

If use the following, attr_list fails to parse the attributes and they are not applied.

# Test {data-test="{}"}

When using Superfences, such an attribute would cause parsing to fail as well. If fenced_code handles it special, we do not currently model that behavior.

iBug commented 11 months ago

If I were to fix this, this seems sufficient:

-((\{(?P<attrs>[^\}\n]*)\})|
+((\{(?P<attrs>[^\n]*)\})|

As it is currently written, the only thing that may come after the attr_list on the opening line is hl_lines, which I would not consider a candidate for bringing braces. Removing } from that character class allows greedy matching while staying on the opening line. Any ideas?

waylan commented 11 months ago

@facelessuser the attr_list parser handles curly braces just fine.

>>> from markdown.extensions.attr_list import get_attrs
>>> get_attrs('data-test="{}"')
[('data-test', '{}')]

Like fenced_code, headers fail because the regex which grabs the attribute list is failing, not the attr_list parser itself. And that is why I ask what Superfences behavior is. It's possible that the method you use to identify the list before passing to the list parser works differently and this is a nonissue.

facelessuser commented 11 months ago

@waylan I was comparing the actual processing of the values in Markdown:

import markdown

MD = R"""
## test {data-test="{}"}
"""

print(markdown.markdown(MD, extensions=['attr_list']))
<h2>test {data-test="{}"}</h2>

In general, we were being consistent with how attr_list was handled everywhere else, but yes, we could adjust our own pattern to deviate. If get_attrs handles them fine, we could possibly relax our code patterns.

facelessuser commented 11 months ago

I think as far as SuperFences is concerned, we have a clear starting indicator (```), so we can probably just relax the pattern.

For normal inline patterns, attr_list would need to have a more precise pattern to avoid gobbling a whole line. Again, I don't know if attr_list plans to fix this for other cases, but at least as far as code blocks are concerned, this may be an easy fix for SuperFences.

In the worst case, if we find undesirable examples of it being too greedy, we could craft a more explicit rule for ourselves.

facelessuser commented 11 months ago

The change has been made in SuperFences.

waylan commented 11 months ago

For normal inline patterns, attr_list would need to have a more precise pattern to avoid gobbling a whole line. Again, I don't know if attr_list plans to fix this for other cases, but at least as far as code blocks are concerned, this may be an easy fix

This is what gives me pause. While it is an easy fix for fenced code blocks, it introduces an inconsistency with all other attr_lists. And to adjust the others would be a much more involved process.

facelessuser commented 11 months ago

Yep, it is definitely more involved. While I think possible, and maybe even doable as a single (much more complex) pattern, it is not nearly as simple.

squidfunk commented 11 months ago

FYI, we removed this feature from our documentation. The current implementation is too fragile and will likely lead to many problems when being used by non-technical users. We don't feel comfortable providing support for it at the current stage.

facelessuser commented 11 months ago

Well, regardless, the next version of pymdown-extensions will handle braces better in code block attributes.

waylan commented 11 months ago

@squidfunk thanks for the update. Given the fact that (1) there is no officially supported need for it, (2) the proposed modification would introduce a inconsistency between attr_lists on fenced code blocks compared to all other elements, and (3) Superfences provides a working solution for those who really want it, I'm inclined to not make a change to fenced code blocks at this time.

That said, if someone were to submit a PR which consistently altered the attr_list behavior across all elements to allow curly braces, I would be wiling to give it consideration. Therefore, I'm not going to close this but will give it the [someday-maybe] label.

oprypin commented 10 months ago

That said, if someone were to submit a PR which consistently altered the attr_list behavior across all elements to allow curly braces, I would be wiling to give it consideration.

I believe it satisfies this requirement