carpedm20 / emoji

emoji terminal output for Python
Other
1.87k stars 273 forks source link

Malformed zero width joiner (`\u200d`) causes `IndexError` #263

Closed miped closed 1 year ago

miped commented 1 year ago

We use the emoji package for stripping emoji from data that we sync into our system. We just experienced an interesting case where the string "\u200dMichael" causes an IndexError in the following code:

import emoji

def strip_emoji(val: str) -> str:
    """Strips unicode emojis from string"""
    return emoji.replace_emoji(val, replace="")

This string is obviously malformed, as a ZWJ should not stand on its own. But emoji should probably not blow up because of it. It comes down to this specific line: https://github.com/carpedm20/emoji/blob/4e1299f0e6e7135f0a338db71c71798e0a43c4d6/emoji/tokenizer.py#L206

Please let me know if you need any other information.

cvzi commented 1 year ago

Thanks for the detailed report!

Downgrading to the previous version of the package will probably prevent the error from happening.

I will create a patch. I should have written test cases to avoid this 😣 -- Enviado desde mi dispositivo Android con K-9 Mail. Por favor, disculpa mi brevedad.