Speedup line_offset property

eanorige commented 10 months ago

When using mkdocs, I noticed huge runtimes and tracked these down to HTMLExtractor.line_offset taking all the time. Before this patch, a run of mkdocs could take 500+ seconds, with this patch, it takes ~10 seconds.

Flamegraph before optimization: mkdocs2

Flamegraph after optimization: mkdocs2a

Note: It may be the case that the full cache of start-positions of all lines isn't needed, and just a single entry is sufficient; I didn't explore this optimization.

eanorige commented 10 months ago

Thanks for triggering CI. Based on its feedback, I have

added the auto-requested changelog entry
removed the extra blank line accidentally introduced
reworded a comment.

In rewording this comment, I realize that there is a slight change in behavior of this function when self.lineno is past the last line; the old implementation would return the position of the last '\n' in the input, while this implementation returns the length of the input (i.e. EOF). If the exact backwards compatible behavior is needed, lf_pos = len(self.rawdata) should be changed to = self.rawdata.rfind('\n'). Let me know what you think.

facelessuser commented 10 months ago

@waylan I am not as familiar with the HTML parsing that you implemented, can a buffer in this situation change on you invalidating the cached offset?

If the exact backwards compatible behavior is needed, lf_pos = len(self.rawdata) should be changed to = self.rawdata.rfind('\n'). Let me know what you think.

Generally, I would say that to have a better chance of getting merged quickly, backwards compatible would be better. When something backwards incompatible is introduced, we would need to be even more cautious to ensure this doesn't inadvertently break some other behavior that is going to cause us more time to debug later.

While the speedup is certainly nice, I would definitely want to ensure this hasn't introduced some unexpected parsing behavior that is going to break 3rd party plugins that aren't immediately obvious.

facelessuser commented 10 months ago

I guess if the change could be somehow identified as inadvertent behavior, then it would make sense to not retain backwards compatibility.

waylan commented 10 months ago

My memory is very hazy about this. That said, the code that this PR replaces is discussed in #1066 and/or #1068 and was committed in #1069. It appears that we added tests to cover the various edge cases.

the old implementation would return the position of the last '\n' in the input, while this implementation returns the length of the input (i.e. EOF). If the exact backwards compatible behavior is needed, lf_pos = len(self.rawdata) should be changed to = self.rawdata.rfind('\n'). Let me know what you think.

Yes, we want the exact same behavior. It is possible that a file may not end with a newline and that would have been taken into consideration with the existing implementation. Therefore, we would need the method to always return the position of a \n exactly. Returning any other position would be an error. Perhaps we need to add a test for that.

waylan commented 10 months ago

can a buffer in this situation change on you invalidating the cached offset?

In short yes. See this comment (and the few that follow) for an explanation. As a reminder, we are using the standard library HTML parser to extract HTML from non-HTML and need to hack it to make it work with any non-HTML which contains angle brackets.

Therefore, we would need the method to always return the position of a \n exactly.

Actually, what I should have said is that the method always must return the beginning of a line. EOF is only the beginning of a (empty) line if the buffer ends with a newline. If it does not end with a newline, the EOF is not the beginning of a line and it would be an error to return that position. In fact, looking at the example given in the above linked comment, it is possible that returning EOF could fail to address the purpose of this hack altogether.

facelessuser commented 10 months ago

In short yes. See https://github.com/Python-Markdown/markdown/issues/1066#issuecomment-726290648 (and the few that follow) for an explanation. As a reminder, we are using the standard library HTML parser to extract HTML from non-HTML and need to hack it to make it work with any non-HTML which contains angle brackets.

This poses a problem then. If the cache can be invalidated without us knowing, then caching is likely a bad idea.

waylan commented 10 months ago

Sorry, I think I misread your question. No, the buffer doesn't change. However, the parser jumps too far ahead and we need to back it up. What is being cached is the last position. The new position we back it up to should always be after the last cached position. And now I'm wondering if I am wrong about what to do if no more newlines are found.

eanorige commented 10 months ago

Actually, what I should have said is that the method always must return the beginning of a line.

The old behavior is actually a bit more complex than that. Here's a sketch of the current (before my patch) implementation:

When there's no newlines in the input file or lineno is <= 1, return 0
Else if lineno is a valid line number, return the position of the first character *after* the newline
Else return the position of the last newline.

I suspect that there's a bug in the existing code and that we should return rfind('\n')+1 in the last case to be consistent with the first two.

ed: remove caching invalidation discussion

waylan commented 10 months ago

I suspect that there's a bug in the existing code and that we should return rfind('\n')+1 in the last case to be consistent with the first two.

What is curious is why no tests are failing. Unless the character at the +1 position is insignificant whitespace or something. I wondered if maybe the tests weren't triggering that code path, but the coverage report says they are. I guess I need to take a closer look at the tests. One or two might need to be tweaked to be sure it fails if we are off by one character.

waylan commented 10 months ago

I just took another look at this. Specifically, the behavior when no newlines are found. I confirmed that we have no tests for this edge case. In fact, the existing code has a # pragma: no cover comment on that code block. If I remove the comment, that line shows up in the coverage report as missing.

I don't recall if I added the comment because I couldn't come up with a scenario which triggered that code path, or because I was focusing on other things and expected to come back to it later. In any event, we should probably try to come up with a test for this. If we do, then that should dictate how it should behave. If not, then I'm not going to be too concerned about it.

waylan commented 10 months ago

So, the concern (how the HTML parser could get into the state where the line number is greater than the total number of lines) is described in this comment. However, among various similar tests, the following test exists:

https://github.com/Python-Markdown/markdown/blob/4f0b91abe1a954a72ab3c99c3e2b880ab36631fa/tests/test_syntax/blocks/test_html_blocks.py#L1119-L1139

I just tried many other variations and I cannot get that condition. I expect that it would take a very unusual edge case (with very poorly formed HTML) to ever get that condition. Therefore, I'm not going to sweet it.

Python-Markdown / markdown

Speedup line_offset property #1392