Some converting issues - Githubissues

SlamperBOOM commented 5 months ago

Hello!

First of all, thank you for your formatter, it was really hard to find what i want, and your solution was the only one that works perfectly for me :). I found 4 issues while using your implementation of formatter and wanna share them with you. Sadly, i fixed only one of them.

1. HTML reserved symbols inside code blocks.

For some reasons, telegram doesn't ignore reserved symbols inside \<code> tag. For example:

<code>
#include <iostream>
...some code here
</code>

Telegram says bad request because of unknown tag \<iostream>.

Fix is easy and works perfect. I made simple func that should be run before all replacements:

def convert_html_chars(text: str):
    text = text.replace("&", "&amp;")
    text = text.replace("<", "&lt;")
    text = text.replace(">", "&gt;")
    return text

And thats how we use it:

def telegram_format(text: str) -> str:
    """
    Converts markdown in the provided text to HTML supported by Telegram.
    """
    # Step 0: convert html reserved symbols
    text = convert_html_chars(text)

    # Step 1: Extract and convert code blocks first
    output, code_blocks = extract_and_convert_code_blocks(text)
   ...your code goes further

I took this replacement from official Telegram API Docs.

2. Incorrect replacements

I found an issue with text from AI like this:

...so you should replace variable `found_text` with viable value and remove variable `error_text` ...

Formatter replaces this text like this:

...so you should replace variable <code>found<i>text</code> with viable value and remove variable <code>error</i>text</code>

And of course, Telegram regret this message with error "Can't find closing tag \</i>". My suggestion is to somehow looking for pair of tags only inside tag where first tag found. I.e. if we found "_" or something else inside other md tag like " ** " or " ``` ". If we can't find any, just don't replace it.

3. Another incorrect replacements

I found an issue with text like this:

... you should use variables model_name and context_name to ...

Formatter replaces this text like this:

... you should use variables model<i>name and context</i>name to ...

Imo, "_" symbols inside word (i.e. surrounded by text symbols, not by whitespaces) should not be replaced. So, my suggestion is to replace "_" symbols to \<i> only when there is a whitespace in front of it and to \</i> only when there is whitespace after it. And of course, replace this symbols only when we can find a pair.

4. List with '*'

Found a small issue where model makes bulleted list with '*' symbols and formatter think that there is a pair of italic tag. My small fix is simple, but i think it isn't a good solution:

    # Process Markdown formatting tags (bold, underline, italic, strikethrough)
    # and convert them to their respective HTML tags
    output = split_by_tag(output, "**", "b")
    output = split_by_tag(output, "__", "u")
    output = split_by_tag(output, "_", "i")
    # output = split_by_tag(output, "*", "i") //this is a fix
    output = split_by_tag(output, "~~", "s")
    output = re.sub(r"\[(.*?)\]\((.*?)\)", r'<a href="\2">\1</a>', output)  # Links
    output = re.sub(r"^\s*[\-\*] (.+)", r"• \1", output, flags=re.MULTILINE)  # Lists

Maybe a proper solution is to change order of replacing calls.

Small suggestion

Sometimes, AI can make answers which are larger than max telegram message symbols - 4096, and we must divide answer in parts smaller than 4096. In that case, there may be a situation when opening and closing tag will be in different parts. My suggestion is to make some split_msg function that will properly divide message into parts and add closing tags at the end of first part and opening tags at the beginning if the second part. Of course, we can firstly divide message into parts and then format them, but then we lose some of text formatting and that's why this isn't a proper solution.

Feel free to ask me questions that will make it clearer to understand my text :)

Latand commented 5 months ago

Thank you for your feedback and for highlighting the issue. It's important to clarify that Telegram's handling of text within <code> and <pre> tags is designed to interpret the content as preformatted text.

This means that within these tags, HTML reserved symbols are not treated the same way as they are outside these tags, allowing for the direct inclusion of characters like &, <, and > without the need for converting them to HTML entities (&, <, and >). Your implementation to convert these symbols to their HTML character references is thoughtful.

However, in the context of Telegram's formatting, this conversion within <code> and <pre> blocks might not be necessary. If you're encountering errors with these symbols within <code> tags, could you please share a screenshot of the issue?

In my experience, such problems typically arise only if the tags are not properly closed, leading to parsing issues.

SlamperBOOM commented 5 months ago

Hello!

Providing screenshot with debug logs that shows telegram error that occures:

Model answer: is what came directly from model, unformatted. Writing answer: is what i got from your formatter without my fix. As you can see, code block is surrouded by <pre><code> tags and there is a <iostream> inside of it. However, telegram consider that <iostream> is tag and throws exception because he don't know that tag.

And that's what i got with my fix:

I replaced < and > with their codes and telegram succesfully processed this message: Maybe there is an issue with aiogram3 that i use to run my bot, but i don't think so.

Also want to ask you about three other issues. Of course you're not obliged to fix them, but maybe you have some thoughts. I want to hear it anyway) If you want to see my code, i can send you a link to my repo.

Thank you in advance

Latand commented 4 months ago

You're correct, I see now! Thank you for pointing out!

tri6odin commented 3 months ago

output = re.sub(r'【[^】]+】', '', text) Remove unuseful links to vector storage like【4:0†source】

Latand commented 3 months ago

@SlamperBOOM @tri6odin

Could you please create pull requests with your suggested changes? Additionally, please include tests and ensure that all tests pass successfully. Thank you!

Latand / formatter-chatgpt-telegram

Some converting issues #1

1. HTML reserved symbols inside code blocks.

2. Incorrect replacements

3. Another incorrect replacements

4. List with '*'

Small suggestion