markdown_html_finder returns offsets in terms of bytes, but our Python code works in terms of unicode. 🏷️<!-- --> is 11 characters, but 16 bytes. So if we have unicode in our PR body, stripping html characters doesn't work correctly.
The fix is to correctly handle byte offsets and unicode offsets. We must convert to bytes to accept the offsets from markdown_html_finder. But we must use unicode to parse HTML comments from those HTML snippets.
markdown_html_finder
returns offsets in terms of bytes, but our Python code works in terms of unicode.🏷️<!-- -->
is 11 characters, but 16 bytes. So if we have unicode in our PR body, stripping html characters doesn't work correctly.The fix is to correctly handle byte offsets and unicode offsets. We must convert to bytes to accept the offsets from
markdown_html_finder
. But we must use unicode to parse HTML comments from those HTML snippets.related #800