0-length `MessageEntity` is no longer valid but MD/HTML parser still produce them

Rongronggg9 commented 2 years ago

Formerly, Telegram simply ignored 0-length MessageEntity, but several days before, Telegram has silently changed its server-side behavior to throw an RPCError 400: ENTITY_BOUNDS_INVALID. However, the Markdown and HTML parsers of Telethon still produce 0-length MessageEntity if the user input does have empty tags.

A typical use case (https://github.com/Rongronggg9/RSS-to-Telegram-Bot/blob/15aef44b6e9b870ec9411398cb83ccb8f38acf68/src/command/administration.py#L17-L34) that can reproduce this issue (value can be an empty string, which is intended):

parseKeyValuePair = re.compile(r'^/\S+\s+([^\s=]+)(?:\s*=\s*|\s+)?(.+)?$')

@command_gatekeeper(only_manager=True)
async def cmd_set_option(event: Union[events.NewMessage.Event, Message], *_, lang: Optional[str] = None, **__):
    kv = parseKeyValuePair.match(event.raw_text)
    if not kv:  # return options info
        options = db.EffectiveOptions.options
        msg = (
                f'<b>{i18n[lang]["current_options"]}</b>\n\n'
                + '\n'.join(f'<code>{key}</code> = <code>{value}</code> '
                            f'({i18n[lang]["option_value_type"]}: <code>{type(value).__name__}</code>)'
                            for key, value in options.items())
                + '\n\n' + i18n[lang]['cmd_set_option_usage_prompt_html']
        )
        await event.respond(msg, parse_mode='html')
        return
    key, value = kv.groups()
    ...

I consider there are 3 solutions:

simply drop 0-length MessageEntity (or even all invalid ones) before returning the parse result.
like 1, but make a warning.
keep the current behavior but update the document and post an announcement on https://t.me/TelethonUpdates.

I'd rather prefer 1 because:

The HTTP bot API allows empty tags.
The HTTP bot API even allows 0-length entities.
Empty tags in HTML are totally valid (not sure for Markdown).
Not to introduce breaking changes (since Telegram formerly allowed them).
f-string, str.format(), % operator, or other templating methods are popular. They usually produce empty tags which are intended (at least in my use case, there are several templates that produce intended empty tags, one of which has been shown before).
The current documentation does not mention that it can't accept empty tags.
The Markdown/HTML parsers provided by Telethon seem to simulate the behavior of the ones provided by HTTP bot API.

Obviously, this is not a bug of Telethon, but I wish Telethon could "fix" it urgently since the change has a wide influence but the use case is common.

I consider there are 2 possible "fixes":

prevent the Markdown and HTML parsers from creating 0-length entities
let helpers.strip_text() to delete them

I'd rather prefer 2 because:

downstream projects can use it even if they create entities themselves
preventing the Markdown parser from creating 0-length entities requires many modifications

I will submit a PR to the v1.24 branch to fix it according to https://github.com/LonamiWebs/Telethon/issues/3880#issuecomment-1190811979.

I am not sure if I should add ENTITY_BOUNDS_INVALID as a new exception since v1 should not receive new features that is not necessary according to the same context.

Lonami commented 2 years ago

Thanks for the detailed report. I agree your choices are the most sensible (and its implementation will have no visible changes to end-users, other than "it works like before").

I am not sure if I should add ENTITY_BOUNDS_INVALID as a new exception

Yes, the errors.csv aims to document all seen errors (you could still trigger this one using raw API). It's even used by other people, so it's good we keep it up to date.

Rongronggg9 commented 2 years ago

Yes, the errors.csv aims to document all seen errors (you could still trigger this one using raw API). It's even used by other people, so it's good we keep it up to date.

Sure! I will create another PR to add this. Which branch should it point to then? v1.24 or master? Or both by creating 2 PRs?

Lonami commented 2 years ago

No, v1.24 only is fine for now.

Lonami commented 2 years ago

Ah we actually fixed this before but I didn't close the issue right? That'd explain why I couldn't reproduce it. Anyway the above commit should work even for user-provided parsers so I'm closing this now. Thanks!

Rongronggg9 commented 2 years ago

Yes, the issue has been fixed. And I also failed to remember to close it, lol. I backported my PR for some days and reverted the backport commit after upgrading to v1.25. I've never seen this bug from then on.

LonamiWebs / Telethon

0-length `MessageEntity` is no longer valid but MD/HTML parser still produce them #3884