RapidRabbit-11485 / PNGTuber-GPT

This is a custom C# action for Streamer.bot and Speaker.bot to add a GPT-based PNGTuber to your stream!
MIT License
9 stars 2 forks source link

RemoveEmojis method incorrectly removes UTF-8 characters #22

Open RapidRabbit-11485 opened 8 months ago

RapidRabbit-11485 commented 8 months ago

Some users using the tool in foreign languages have noted that UTF-8 characters are being removed from responses unintentionally. Upon further investigation and AI analysis, it was found that the error was in the RemoveEmojis() method. The point of this method is to strip emojis from the response since Speaker.bot cannot properly pronounce an emoji. Upon review, it was found that the regex pattern applied here was removing all non-ASCII characters, and that net is too wide.

AI recommended the following code change:

private string RemoveEmojis(string text)
{
    // ... (logging code remains the same)

    // Regular expression pattern to match emoji characters specifically
    string emojiPattern = @"[\p{Emoji}]";  // Use Unicode property for emoji characters

    // Log the updated regex pattern
    LogToFile($"Using regex pattern to remove emojis: {emojiPattern}", "DEBUG");

    // Replace only emoji characters with empty string
    string sanitizedText = Regex.Replace(text, emojiPattern, "");

    // ... (rest of the code remains the same)
}

Explanation of changes:

  1. Updated regex pattern:
    • The original pattern [^\u0000-\u007F]+ matched any character outside the ASCII range, which inadvertently removed non-ASCII characters like ã and ç.
    • The new pattern [\p{Emoji}] specifically targets emoji characters using the Unicode "Emoji" property.
  2. Corrected logging comment
    • The comment for nonAsciiPattern was updated to reflect the corrected pattern name.

Additional considerations:

Dependency requirements:

.NET:

Key points:

I am suggesting a modification to the RemoveEmojis() method that would allow for the removal of emojis without interfering with other character sets.

RapidRabbit-11485 commented 8 months ago

Dangers of trusting AI, that recommended code did not work. However, I believe we did get down to a regex pattern that does: emojiPattern = @"[\uD83C-\uDBFF\uDC00-\uDFFF]" I'm currently doing some testing, but it does at least remove the emojis. Supposedly this is more narrowly scoped to just the emoji range.