KittehOrg / KittehIRCClientLib

An IRC client library in Java
https://kitteh.dev/kicl/
MIT License
146 stars 34 forks source link

Default cutter does not account for Unicode characters encoded as multiple bytes #307

Closed zachbr closed 11 months ago

zachbr commented 1 year ago

Expected behavior

I expect strings containing unicode characters that are encoded as multiple bytes to process correctly through KICL and show through on the IRC servers and their connected clients.

Actual behavior

Messages will lose the last x characters when they include zero-width spaces and other multi-byte unicode characters (some Unicode emoji may be a more relevant example here). It appears that this is caused by the IRC servers silently truncating messages that are too long.

Stacktrace

There is no stacktrace.

Analysis

Per the IRC RFC:

   IRC messages are always lines of characters terminated with a CR-LF
   (Carriage Return - Line Feed) pair, and these messages SHALL NOT
   exceed 512 characters in length, counting all characters including
   the trailing CR-LF. Thus, there are 510 characters maximum allowed
   for the command and its parameters.  There is no provision for
   continuation of message lines.  See section 6 for more details about
   current implementations.

This RFC is from the good ol' days of plain C, so 512 characters really means 512 bytes. Despite that, most IRC networks now support the UTF-8 encoding, and many languages (like Java) no longer treat characters as plain ASCII. Since the Cutter class is splitting messages into words and attempting to limit their size to a maximum of 512 characters (on Java's definition of a UTF-16 character), it will fail to handle cases where those characters actually end up encoded as multiple bytes.

The IRC servers (despite most all now supporting UTF-8), will still enforce the message length limit as bytes. Most servers will enforce this limit by truncating the message to the allowed limit. On IRC servers that do not silently truncate, an error message can be observed. The below error was observed from the Ergo IRC server (formerly Oragono):
[I] 417 KiclTestBot :Line too long to be relayed without truncation

Sure enough, when adding some println at the end of the message cutter:

List element at index 0 is of length 448 encoded as 452 bytes. Max bytes remaining for message is 449.
List element at index 1 is of length 442 encoded as 442 bytes. Max bytes remaining for message is 449.
List element at index 2 is of length 186 encoded as 186 bytes. Max bytes remaining for message is 449.

The fix for this would be making the message cutter operate on encoded bytes rather than characters. I have done a rudimentary test and that appears to fix this, although it's a bit less nice to look at.

To Reproduce

Using the code below, notice that the message sent in IRC drops characters from the first message based on whether there are zero-width space characters present.

package org.example;

import net.engio.mbassy.listener.Handler;
import org.kitteh.irc.client.library.Client;
import org.kitteh.irc.client.library.event.channel.ChannelMessageEvent;

import java.text.SimpleDateFormat;
import java.util.Date;

public class Main {

    public static void main(String[] args) {
        new Main();
    }

    private Client ircClient;
    private char zwsp = 0x200B;
    private String channelName = "#CHANNEL";
    private String loremIpsum = "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Senectus et netus et malesuada fames ac. Laoreet suspendisse interdum consectetur libero id faucibus. Justo nec ultrices dui sapien. Ultricies tristique nulla aliquet enim tortor at auctor urna nunc. Amet mauris commodo quis imperdiet massa tincidunt nunc. Consequat mauris nunc congue nisi. Sed adipiscing diam donec adipiscing. Porta lorem mollis aliquam ut porttitor leo. Sed nisi lacus sed viverra tellus. Nibh mauris cursus mattis molestie a iaculis. Pulvinar elementum integer enim neque volutpat ac tincidunt vitae. Dictumst vestibulum rhoncus est pellentesque elit ullamcorper.";
    private String message = "<test> " + loremIpsum;
    private String messageWithZwsp = "<" + zwsp + "t" + zwsp + "e" + zwsp + "s" + zwsp + "t" + zwsp + "> " + loremIpsum;

    Main() {
        var ircNick = "KiclTestBot";

        SimpleDateFormat sdf = new SimpleDateFormat("mm:ss");
        ircClient = Client.builder()
                .server()
                .host("irc.esper.net")
                .port(6697, Client.Builder.Server.SecurityType.SECURE)
                .then()
                .nick(ircNick)
                .name(ircNick)
                .user(ircNick)
                .listeners()
                .input(line -> System.out.println(sdf.format(new Date()) + ' ' + "[I] " + line))
                .output(line -> System.out.println(sdf.format(new Date()) + ' ' + "[O] " + line))
                .exception(Throwable::printStackTrace)
                .then()
                .buildAndConnect();

        ircClient.setExceptionListener(Throwable::printStackTrace);
        ircClient.getEventManager().registerEventListener(this);

        ircClient.addChannel(channelName);
        ircClient.sendMessage(channelName, "Hello world!");
    }

    @Handler
    public void onMessage(ChannelMessageEvent event) {
        ircClient.sendMessage(channelName, "-- new msg start --");
        ircClient.sendMultiLineMessage(channelName, message);

        ircClient.sendMessage(channelName, "-- new msg w/ zwsp start --");
        ircClient.sendMultiLineMessage(channelName, messageWithZwsp);

        ircClient.sendMessage(channelName, "-- end --");
    }
}

Version information

Additional context

Tested on EsperNet and OFTC.

Output from input/output listeners:

51:05 [O] PRIVMSG #CHANNEL :-- new msg start --
51:06 [O] PRIVMSG #CHANNEL :<test> Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Senectus et netus et malesuada fames ac. Laoreet suspendisse interdum consectetur libero id faucibus. Justo nec ultrices dui sapien. Ultricies tristique nulla aliquet enim tortor at auctor urna nunc. Amet mauris commodo quis imperdiet massa tincidunt nunc. Consequat mauris nunc congue nisi. Sed adipiscing diam donec
51:07 [O] PRIVMSG #CHANNEL :adipiscing. Porta lorem mollis aliquam ut porttitor leo. Sed nisi lacus sed viverra tellus. Nibh mauris cursus mattis molestie a iaculis. Pulvinar elementum integer enim neque volutpat ac tincidunt vitae. Dictumst vestibulum rhoncus est pellentesque elit ullamcorper.
51:09 [O] PRIVMSG #CHANNEL :-- new msg w/ zwsp start --
51:10 [O] PRIVMSG #CHANNEL :<​t​e​s​t​> Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Senectus et netus et malesuada fames ac. Laoreet suspendisse interdum consectetur libero id faucibus. Justo nec ultrices dui sapien. Ultricies tristique nulla aliquet enim tortor at auctor urna nunc. Amet mauris commodo quis imperdiet massa tincidunt nunc. Consequat mauris nunc congue nisi. Sed adipiscing diam donec
51:11 [O] PRIVMSG #CHANNEL :adipiscing. Porta lorem mollis aliquam ut porttitor leo. Sed nisi lacus sed viverra tellus. Nibh mauris cursus mattis molestie a iaculis. Pulvinar elementum integer enim neque volutpat ac tincidunt vitae. Dictumst vestibulum rhoncus est pellentesque elit ullamcorper.
51:12 [O] PRIVMSG #CHANNEL :-- end --

What actually comes through on IRC: TheLounge v4.4.0 HexChat v2.16.1

mbax commented 11 months ago

Resolved in https://github.com/KittehOrg/KittehIRCClientLib/commit/c2f886012505394db95d4f58f405812cf4de4124