Python-Markdown / markdown

A Python implementation of John Gruber’s Markdown with Extension support.
https://python-markdown.github.io/
BSD 3-Clause "New" or "Revised" License
3.74k stars 858 forks source link

InlineProcessor is not detecting matches or adds some additional text into a match #1299

Closed naitsok closed 1 year ago

naitsok commented 1 year ago

Hi all! Thank you for the hard work developing Python-Markdown!

I need help to figure out the few problems that I encountered writing my InlineProcessor to extend Python-Markdown. I have studied the documentation and tutorial, but could not figure out why my code does not work. I use python 3.10 and Python-Markdown 3.4.1.

Description of the problem. I want to display markdown with LaTeX equations typeset by MathJax in browser after processing markdown. Thus, the extension I want to write must detect the LaTeX equations and do not perform any markdown processing inside them. I have the following test string to be processed with Python-Markdown:

>>> s
'The reference for a super wise [quote](#quote-1). And referencing [figure](#figure-1).\r\nSome great inline equation \\\\[ \\frac{1}{2} \\\\ \\frac{1}{2} \\\\].\r\n\r\n$$ Block\\, \\\\ Math $$\r\n\r\nA reference to an equation \\eqref{eqn:sample}. $\\frac{1}{2} \\\\ \\frac{1}{2}$.\r\n\r\n\\begin{align}\r\nA & = \\int_0^\\infty \\frac{x^3}{e^x-1}\\,dx \\\\\r\n& = \\frac{\\pi^4}{15}\r\n\\label{eqn:sample}\r\n\\end{align}\r\n\r\n```latex\r\n\\begin{equation}\r\n\\omega_p\r\n\\end{equation}\r\n```\r\n\\begin{bmatrix}\r\nA & B \\\\\r\nC & D\r\n\\label{eqn:sample2}\r\n\\end{bmatrix}\r\nmore text more and more text writing.\r\n\r\n```python\r\ndef field(self):\r\n        field_kwargs = {"widget": MyMarkdownTextarea(attrs={"rows": self.rows})}\r\n        field_kwargs.update(self.field_options)\r\n        return forms.CharField(**field_kwargs)\r\n```'

In this string I want to detect both equations inside \\( ... \\), \\[ ... \\] (Note the double backslashes before the brackets), as well as inside the \begin{...} ... \end{...} blocks.

Now here are the two problems that I have:

Problem 1

The simple InlineProcessor similar to class SimpleTextInlineProcessor(InlineProcessor):

RE_SQUARE = r'()\\\\\[+([^$\n]+?)\\\\\]+'

class MathJaxBracketProcessor(InlineProcessor):
    def handleMatch(self, m, data):
        text = html.escape('\\begin{' + m.group(1) + '}'+ m.group(2).strip() + '\\end{' + m.group(1) + '}\n')
        return text, m.start(0), m.end(0)

class MathJaxExtension(Extension):
    def extendMarkdown(self, md):
        md.inlinePatterns.register(MathJaxBracketProcessor(RE_SQUARE), 'inline-eq', 173)

def makeExtension(configs={}):
    return MathJaxExtension(**configs)

The problem is that is never ends up in a match. On the other hand, if I use just the python interpreter I get:

>>> re.findall(r'()\\\\\[+([^$\n]+?)\\\\\]+', s)
[('', ' \\frac{1}{2} \\\\ \\frac{1}{2} ')]

Everything works! The regular expression is correct and finds the entry correctly.

** Why is not MathJaxBracketProcessor detecting the substring?

Problem 2

Problem 2 is even more strange. Now I have the processor to detect \begin{...} ... \end{...} pattern:

RE_BEGIN = r'\\begin{(.*?)}([\s\S]*?)\\end{\1}'

class MathJaxBeginPattern(Pattern):
    def handleMatch(self, m):
        text = html.escape('\\begin{' + m.group(1) + '}'+ m.group(2).strip() + '\\end{' + m.group(1) + '}\n')
        return text, m.start(0), m.end(0)

class MathJaxExtension(Extension):
    def extendMarkdown(self, md):
        md.inlinePatterns.register(MathJaxBeginPattern(RE_BEGIN), 'begin', 172)

def makeExtension(configs={}):
    return MathJaxExtension(**configs)

This processor finds the match, but instead of having for the first \begin{align}...\end{align} block the following match

>>> m.group(2) 
'\r\nA & = \\int_0^\\infty \\frac{x^3}{e^x-1}\\,dx \\\\\r\n& = \\frac{\\pi^4}{15}\r\n\\label{eqn:sample}\r\n'

it gets

>>> m.group(2)
'\nA & = \\int_0^\\infty \\frac{[x^3]*(y_3)}{e^x-1}\\,dx \x02klzzwxh:0007\x03\n& = \\frac{\\pi^4}{15}\n\\label{eqn:sample}\n'

with additional weird symbols \x02klzzwxh:0007\x03 instead of just \\\\. When I run the markdown() function without MathJaxBeginPattern, then markdown does not produce such symbols and convert everything correctly. Why it is like that?

Here is the full python code to represent the second issue:

import markdown

RE_BEGIN = r'\\begin{(.*?)}([\s\S]*?)\\end{\1}'
RE_INLINE = r'()\\\\\[+([^$\n]+?)\\\\\]+'

s =  'The reference for a super wise [quote](#quote-1). And referencing [figure](#figure-1).\r\nSome great inline equation \\\\[ \\frac{1}{2} \\\\ \\frac{1}{2} \\\\].\r\n\r\n$$ Block\\, \\\\ Math $$\r\n\r\nA reference to an equation \\eqref{eqn:sample}. $\\frac{1}{2} \\\\ \\frac{1}{2}$.\r\n\r\n\\begin{align}\r\nA & = \\int_0^\\infty \\frac{x^3}{e^x-1}\\,dx \\\\\r\n& = \\frac{\\pi^4}{15}\r\n\\label{eqn:sample}\r\n\\end{align}\r\n\r\n```latex\r\n\\begin{equation}\r\n\\omega_p\r\n\\end{equation}\r\n```\r\n\\begin{bmatrix}\r\nA & B \\\\\r\nC & D\r\n\\label{eqn:sample2}\r\n\\end{bmatrix}\r\nmore text more and more text writing.\r\n\r\n```python\r\ndef field(self):\r\n        field_kwargs = {"widget": MyMarkdownTextarea(attrs={"rows": self.rows})}\r\n        field_kwargs.update(self.field_options)\r\n        return forms.CharField(**field_kwargs)\r\n```'

class InlineMathJaxProcessor(markdown.inlinepatterns.InlineProcessor):
    def handleMatch(self, m, data):
        print(m.group(2))
        text = '\\begin{' + m.group(1) + '}'+ m.group(2).strip() + '\\end{' + m.group(1) + '}\n'
        return text, m.start(0), m.end(0)

class MyExtension(markdown.extensions.Extension):
    def extendMarkdown(self, md):
        md.inlinePatterns.register(InlineMathJaxProcessor(RE_BEGIN), 'eqn', 171)

print(markdown.markdown(s, extensions=[MyExtension()]))

Output:

<p>The reference for a super wise <a href="#quote-1">quote</a>. And referencing <a href="#figure-1">figure</a>.
Some great inline equation \[ \frac{1}{2} \ \frac{1}{2} \].</p>
<p>$$ Block\, \ Math $$</p>
<p>A reference to an equation \eqref{eqn:sample}. $\frac{1}{2} \ \frac{1}{2}$.</p>
<p>\begin{align}A &amp; = \int_0^\infty \frac{x^3}{e^x-1}\,dx klzzwxh:0007
&amp; = \frac{\pi^4}{15}
\label{eqn:sample}\end{align}
</p>
<p><code>latex
\begin{equation}
\omega_p
\end{equation}</code>
\begin{bmatrix}A &amp; B klzzwxh:0010
C &amp; D
\label{eqn:sample2}\end{bmatrix}

more text more and more text writing.</p>
<p><code>python
def field(self):
        field_kwargs = {"widget": MyMarkdownTextarea(attrs={"rows": self.rows})}
        field_kwargs.update(self.field_options)
        return forms.CharField(**field_kwargs)</code></p>

Can anyone explain me why I get such results? I could not figure out much from docs and source code. Basically my InlineProcessor just repeats the builtin SimpleTextInlineProcessor but works improperly by ignoring matches or adding strange symbols. When I used markedjs with the extensions with the same regular expressions, everything works just fine. Any help is very much appreciated.

facelessuser commented 1 year ago

I believe this problem has been solved multiple times: https://github.com/Python-Markdown/markdown/wiki/Third-Party-Extensions#math--latex. I myself support such a plugin: https://facelessuser.github.io/pymdown-extensions/extensions/arithmatex/. Now, that isn't to say you have to use any of these and you are free to write your own regardless, but I wanted to point this out in case you were simply not aware.

It is difficult for me to even begin to debug what you've posted because there is so much context I do not have. Just for starters:

There are probably a number of other questions as well.

With all of that said, most likely what is happening is that some other extension is touching the content before yours. \x02klzzwxh:0007\x03 is most likely a code block grabbing content or some other escaping before your plugin ever gets a hold of it. You may try figuring out an earlier time to run your plugin.

While you can certainly go down this road for learning purposes, I would personally recommend you use one of the already available solutions that have already solved this problem

waylan commented 1 year ago

First of all, I notice that your sample string that you are attempting to match contains blank lines within it. Markdown breaks up the text into paragraphs first, then runs inline processors on the contents of each paragraph. Therefore, the entire string would never be all contained within a single paragraph and your match would fail. If you want to match a block that spans multiple blank lines, then you need to use a block processor or possibly a preprocessor.

The \x02klzzwxh:0007\x03 strings are placeholders (see util.py#L50-L53) for previously parsed inline markup which your extension blocks from being swapped back out for the actual characters later in the process. This suggests that you need to run your inline processor earlier in the process to avoid other inline processors from matching the text first.

I'm curious if you have tried one of the existing math related extensions. There are a few of them that work quite well. See a full list in the wiki.

naitsok commented 1 year ago

Hi,

Thank you very much for the replies. It seems that I indeed did not do enough search to find these plugins. However, I still do believe something strange is happening. I use only default markdown 3.4.1 and python 3.10. So there are only default markdown and my extension, nothing else.

Here is the minimal working example to get the issue. I can be directly copied into an empty python file and run:

import markdown

RE_BEGIN = r'\\begin{(.*?)}([\s\S]*?)\\end{\1}'

s = 'Some text. [link](#quote-1). \r\n\\begin{align}\r\nA & = \\frac{1}{2} \\\\\r\n& = \\frac{1}{2}\r\n\\label{eqn:sample}\r\n\\end{align}. More Text.'

class InlineMathJaxProcessor(markdown.inlinepatterns.InlineProcessor):
    def handleMatch(self, m, data):
        text = '\\begin{' + m.group(1) + '}'+ m.group(2).strip() + '\\end{' + m.group(1) + '}\n'
        return text, m.start(0), m.end(0)

class MyExtension(markdown.extensions.Extension):
    def extendMarkdown(self, md):
        ext = InlineMathJaxProcessor(RE_BEGIN)
        md.inlinePatterns.register(ext, 'eqn', 171)

print(markdown.markdown(s, extensions=[MyExtension()]))

And it gives

<p>Some text. <a href="#quote-1">link</a>.
\begin{align}A &amp; = \frac{1}{2} klzzwxh:0000
&amp; = \frac{1}{2}
\label{eqn:sample}\end{align}
. More Text.</p>

Even if I put the order number say 55 instead of 171. it still gives the same result. But I do understand now that most likely multiline equation gives this problem. Thank you for your time and comments, I will look through the math extensions to find the best suitable for me.

facelessuser commented 1 year ago

Even if I put the order number say 55 instead of 171. it still gives the same result. But I do understand now that most likely multiline equation gives this problem. Thank you for your time and comments, I will look through the math extensions to find the best suitable for me.

You are assuming 55 runs earlier. Try something like 189.

naitsok commented 1 year ago

Yes, my mistake, probably was not putting enough attention of reading docs. With 189 my processor works well. Thank you! Now I just need to figure out how to keep the contents of this processor from going through other processors.

facelessuser commented 1 year ago

Documentation: https://python-markdown.github.io/extensions/api/#registries. Emphasis mine

When registering an item, a “name” and a “priority” must be provided. All items are automatically sorted by the value of the “priority” parameter such that the item with the highest value will be processed first. The “name” is used to remove (deregister) and get items.

Now I just need to figure out how to keep the contents of this processor from going through other processors.

Again, most of this is already solved in existing plugins. But you would most likely wrap you content in some span or div (depending on inline or block) and then use the AtomicString API to prevent Markdown from doing further work on it.

naitsok commented 1 year ago

Thank you all! I suppose I finally understand how it works and soon will be able to make it working myself. Thank you for the answer and sorry for bothering.