Python-Markdown / markdown

A Python implementation of John Gruber’s Markdown with Extension support.
https://python-markdown.github.io/
BSD 3-Clause "New" or "Revised" License
3.74k stars 858 forks source link

Add a new option of CodeHilite to assign lang class when Pygments is in use #1255

Closed ccwang002 closed 2 years ago

ccwang002 commented 2 years ago

This change adds a new optionpygments_add_lang_class to the CodeHilite extension. When Pygments and this option are enabled, the language of the code block is assigned to the generated <code> tag. The option respects options lang and lang_prefix like the existing behavior. If the language of the code block is guessed by Pygments, the name of the guessed lexer is used as the language name (technically the first alias of the lexer).

In short, with this option enabled, CodeHilite's generated output will always be HTML5 compliant with or without Pygments.

The generated HTML output of the following code block:

```python
 print('hello world')

will become:

``````html
<pre><code class="language-python">...
</pre></code>

Regarding the implementation, this option utilizes the new behavior of pygments_formatter that can take in a custom Pygments formatter introduced in PR #1187. We create a custom HTML formatter class that takes in the language and the prefix. User can still override to use their custom formatter when they specify pygments_formatter.

I am open to any suggestions and feedbacks. Thank you for making this package!

waylan commented 2 years ago

Obviously this still needs some work to get all test passing. However, I thought I would comment on the general idea. Personally, I find this intriguing. I have never spent much time exploring what can be done with custom formatters. That said, I have a few concerns.

I'm not crazy about the name pygments_add_lang_class. But I don't have any better suggestions. That said, the option seems redundant. In fact the same thing can be accomplished simply by passing in the custom formatter to pygments_formatter. Although, in text based configs (command line etc.) passing in Python objects is more difficult (although not impossible), so this does offer an easy solution in those scenarios.

Which brings me to my final concern. I wonder if this would be better as a third party addon. All that is really needed is for the user to have the definition of the custom formatter. Then they can pass than in to pygments_formatter. In fact, the custom formatter is simple enough that we could include it in the documentation as an example of how a user could alter Pygments' behavior. Users could then copy the class from the documentation into their own code and modify it as they see fit.

In short, with this option enabled, CodeHilite's generated output will always be HTML5 compliant with or without Pygments.

The above comment concerns me a little. Let's consider the relevant section of the spec:

There is no formal way to indicate the language of computer code being marked up. Authors who wish to mark code elements with the language used, e.g. so that syntax highlighting scripts can use the right rules, can use the class attribute, e.g. by adding a class prefixed with "language-" to the element.

First of all, it is clear that there is "no formal" specification for how to define a language. Second, it is suggested that one "can use the class attribute." However, that is a suggestion, not a requirement. And then the example of using the prefixed class is an example of one possible way that could be done. I agree with the spec writers and think that the suggestions are good ones. However, this is not required by the spec and we don't want to suggest that it is. Therefore, our output is compliant with or without this change. To suggest otherwise is misleading.

Finally, the suggested reason for identifying the language is so that syntax highlighters can properly work on the code. However, in our case, the syntax highlighter has already done its work. Therefore, there is no functional need for the language to be identified. The language only serves as meta-data at this point. I'm not saying that it serves no value, but it is certainly less important to include.

Given all of the above, I'm leaning more and more to my documentation idea. Thoughts anyone?

facelessuser commented 2 years ago

Personally, I don't know why the class needs to be attached to the code element. You don't need to prep the elements for a JS highlighter as it is already highlighted. At most, you are using it to do some personal styling. I would just use some simple regex to attach a lang-name class to the div once the results are returned, and for styling, target the code block with something like .codehilite lang-name code. The very latest Pygments wraps all their code blocks with <div> now, even the tabled, line number format. After that, you can use the language class for styling a specific language.

I maintain an alternative to codehilite and fenced_code myself, and that is the approach I took. Have I used custom HTML formatters for things? Sure, but doing this can require you to have to update breakage if Pygments changes the formats (which they did on the very latest Pygments). I'm not saying that is a deal-breaker, just something to be aware of.

Using an HTML formatted is certainly an option, and if we find this approach attractive to implement this feature, especially if we deem that such a class must be attached to <code> (though I don't see why), I imagine it is a bit easier to do it with a custom HTML formatter.

facelessuser commented 2 years ago

I will say that I'm +1 on adding the language class as I've come to learn people like to use it for different reasons. I'm +0 on using a custom HTML format class 🤷🏻.

ccwang002 commented 2 years ago

Thanks for your suggestions and feedbacks! I agree there might be better option names. Happy to change it.

I also agree that the custom formatter can comes as a code recipe on the documentation. That said, currently in the CodeHilite extenstion, the formatter doesn't have access to the language information. If we decide not to include a custom formatter and get rid of the option, we should at least consider passing lang and maybe lang_prefix to the formatter like the following:

https://github.com/Python-Markdown/markdown/blob/7d4ecad0b17c97b1b009deeb5600f767afbc008f/markdown/extensions/codehilite.py#L188-L191

Regarding the HTML5 compliance, I saw this wording on the documentation for the non-pygments generation, so I decided to copy it. I will tone it down later.

The language class doesn't need to be at the <code> tag, it can be at the enclosing <div> tag or anywhere users desire. It's just a common practice (as suggested by the spec) to have the <pre><code class="language-xxx"> structure. The new Pygments certainly broke the tests here (got bit by it), I think the pre>code part is quite stable. We can probably update the tests to just check the outer parts that's controlled by our options.

Indeed, this option is about the language identification. My main problem is that there is no way I am aware of that indicates the original language in the generated code blocks, using the official extensions. Personally I'd love to just stick to the official ones because I won't be able to maintain my own in the long run.

Besides syntax highlighting, having the language identification in the generated code blocks opens up some interactive functionalities. One example I can think of right away is have the console outputs toggleable on the official Python documentation:

Console output on Console output off
image image

To sum up, we have a few approaches if we want to provide customization:

  1. Figure out some way to expose the language information to the Pygment formatter (no new option added)
  2. Additionally, create a recipe on the documentation for some example usage
  3. Additionally, actually ship a custom formatter along with the extension (new option added)
  4. Entirely different solution (not sure exactly how)
facelessuser commented 2 years ago

Yes, I do agree there is utility in using a lang-name class on Pygments code blocks. I recently added optional support for this myself in Pymdown Extensions, so I am okay with the idea behind it. I just appended it to the <div> after getting the results from Pygments. Again, I'm not against using a custom HTML formatted per se, so I have no comments specifically on that portion.

waylan commented 2 years ago

I also agree that the custom formatter can comes as a code recipe on the documentation. That said, currently in the CodeHilite extenstion, the formatter doesn't have access to the language information. If we decide not to include a custom formatter and get rid of the option, we should at least consider passing lang and maybe lang_prefix to the formatter

Right. That would be reasonable.

I will need to think on what I consider the best way forward here.

waylan commented 2 years ago

Pygments will accept any keyword arguments passed to it and will simply ignore any it does not recognize without error. Therefore, we can easily pass the language (and prefix) to all formatters, even the Pygments built-in ones. Therefore, I think the correct way forward would be to:

  1. Include the language in the arguments passed to any formatter. This should probably be a single string which combines the prefix and the language.
  2. Document how to use a custom formatter. In the example provided in the documentation, use something similar to the subclass included in this PR.

And that's it. We could insert the language on the div, but I don't see a need for this. Those users who really want that can use a custom formatter and then they can set the language wherever they want it.

I would close this PR, but I'll leave it open as the "issue" to be closed when a new PR is merged that implements the above.

ccwang002 commented 2 years ago

👍 sounds good to me! I can make another PR with the suggested implementation.