Python-Markdown / markdown

A Python implementation of John Gruber’s Markdown with Extension support.
https://python-markdown.github.io/
BSD 3-Clause "New" or "Revised" License
3.8k stars 863 forks source link

SmartyPants is very slow #1103

Open kevinrenskers opened 3 years ago

kevinrenskers commented 3 years ago

When I am parsing 64 Markdown articles with Python-Markdown, it was taking a little bit over 4 seconds, 3.1 of which was from the SmartyPants extension. It's so much slower than any other extension. Not sure if this is known or not, if you consider it a bug, but wanted to bring it to your attention and ask if this is expected.

waylan commented 3 years ago

This does not surprise me. I always objected to Smartypants being a Markdown extension instead of an external post processor. It was never designed to be performant. You have 2 options:

  1. Do the work yourself to improve it and submit a PR.
  2. Use an external post processor which runs on the output of Markdown.
oprypin commented 3 years ago

Can you provide an example site on which it is slow?

kevinrenskers commented 3 years ago

No, I ended up removing Python-Markdown because of this problem. But it should be rather trivial to reproduce: try to parse a substantial (2000 word) Markdown document 60 times, and time it with and without using the SmartyPants extension.

oprypin commented 3 years ago

I have looked into this, and the implementation is indeed fundamentally slow. Some small things can be done, but no real solution in sight.

  1. It takes the whole paragraph, splits it into 3 parts (before, substitution, after), then reconstructs it once per each substitution. https://github.com/Python-Markdown/markdown/blob/12be7c9a30247b1df9a211a4cc90dfb9aa136fae/markdown/treeprocessors.py#L312-L313
  2. After each substitution happens, it goes back to the beginning of the string and tries to apply each of the regexes again. So basically each regex (of which there are around 20, both from Smarty and the core ones) is executed once per substitution. https://github.com/Python-Markdown/markdown/blob/12be7c9a30247b1df9a211a4cc90dfb9aa136fae/markdown/treeprocessors.py#L271-L272

Neither of the two is the clear bigger bottleneck.

This behavior is basically how full extensibility via plugins is ensured, and trying to rework or even just trying to optimize the lookbacks risks subtly breaking any of the downstream plugins.


But all I've said so far applies generally to all InlineProcessors (of which there are very many for various core features). The reason that Smarty is particularly affected by this, then, is two-fold:

  1. Probably just prone to a lot of small replacements? That's minor, though.
  2. Its regular expressions seem to be very numerous and very heavy on the engine.

So, the small things that can be done:

  1. Just plain regex optimization.
  2. Maybe reducing the number of regexes by joining them all into one
    • I have actually tried this and it did not help much, though.
  3. Various micro-optimizations that don't change any logic.
mitya57 commented 3 years ago

I am the original author of this extension, but unfortunately I don’t have enough time to dig into this. So the only thing I can say is that PRs are welcome.

raylutz commented 2 years ago

I find that markdown.markdown (md to html conversion) is bogged down due to embedded html with < or > enclosing it. It seems the algorithm is slow because it does character by character parsing rather than perhaps looking for '<' and '>' in regex and then marking the location in the file so it can be parsed and apparently removed and then reinserted. I am considering working around this issue by inserting the html after the conversion. Not sure if this has anything to do with "SmartyPants", I am speaking about markdown.markdown().