chdsbd / kodiak

🔮 A bot to automatically update and merge GitHub PRs
https://kodiakhq.com
GNU Affero General Public License v3.0
1.03k stars 65 forks source link

fix stripping html to correctly handle byte offsets #805

Closed chdsbd closed 2 years ago

chdsbd commented 2 years ago

markdown_html_finder returns offsets in terms of bytes, but our Python code works in terms of unicode. 🏷️<!-- --> is 11 characters, but 16 bytes. So if we have unicode in our PR body, stripping html characters doesn't work correctly.

The fix is to correctly handle byte offsets and unicode offsets. We must convert to bytes to accept the offsets from markdown_html_finder. But we must use unicode to parse HTML comments from those HTML snippets.

related #800