Comcast / cmb

This project is no longer actively supported. It is made available as read-only. A highly available, horizontally scalable queuing and notification service compatible with AWS SQS and SNS
Apache License 2.0
277 stars 50 forks source link

isValidUnicode() in src/com/comcast/cmb/common/util/Util.java #39

Closed oldshuren closed 9 years ago

oldshuren commented 9 years ago

The function isValidUnicode() should check 4 bytes unicode. I found this bug when we tried to using emoji :)

The comment in that function actually already included the 4 bytes unicode range, I guess the author just forget implement it.

Here is modified isValidUnicode()

public static boolean isValidUnicode(String msg) {
    //[\u0020-\uD7FF\uE000-\uFFFD\uD800\uDC00-\uDBFF\uDFFF\r\n\t]
    char[] chs = msg.toCharArray();
    for (int i = 0; i < chs.length; i++) {
        if (chs[i] == '\n' || chs[i] == '\t' || chs[i] == '\r' || (chs[i] >= '\u0020' && chs[i] <= '\uD7FF') || (chs[i] >= '\uE000' && chs[i] <= '\uFFFD')) {
            continue;               
        } else if (i<chs.length) { // check for 4 bytes unicode, 2 char (utf16)
            if ((chs[i] >= '\uD800' && chs[i+1] >= '\uDC00') && (chs[i] <= '\uDBFF' && chs[i+1] <= '\uDFFF')) {
                i++; // skip the next char
                continue;
            }
        } else {
            return false;
        }
    }
    return true;
}
boriwo commented 9 years ago

checked into head

oldshuren commented 9 years ago

Sorry, I think there is a bug in my fix, here the new one.

public static boolean isValidUnicode(String msg) { //[\u0020-\uD7FF\uE000-\uFFFD\uD800\uDC00-\uDBFF\uDFFF\r\n\t] char[] chs = msg.toCharArray(); for (int i = 0; i < chs.length; i++) { if (chs[i] == '\n' || chs[i] == '\t' || chs[i] == '\r' || (chs[i] >= '\u0020' && chs[i] <= '\uD7FF') || (chs[i] >= '\uE000' && chs[i] <= '\uFFFD')) { continue;
} else if (i<chs.length-1) { // check for 4 bytes unicode, 2 char (utf16) if ((chs[i] >= '\uD800' && chs[i+1] >= '\uDC00') && (chs[i] <= '\uDBFF' && chs[i+1] <= '\uDFFF')) { i++; // skip the next char continue; } } else { return false; } } return true; }

oldshuren commented 9 years ago

Better formating,

public static boolean isValidUnicode(String msg) {
    //[\u0020-\uD7FF\uE000-\uFFFD\uD800\uDC00-\uDBFF\uDFFF\r\n\t]
    char[] chs = msg.toCharArray();
    for (int i = 0; i < chs.length; i++) {
        if (chs[i] == '\n' || chs[i] == '\t' || chs[i] == '\r' || (chs[i] >= '\u0020' && chs[i] <= '\uD7FF') || (chs[i] >= '\uE000' && chs[i] <= '\uFFFD')) {
            continue;               
        } else if (i<chs.length-1) { // check for 4 bytes unicode, 2 char (utf16)
            if ((chs[i] >= '\uD800' && chs[i+1] >= '\uDC00') && (chs[i] <= '\uDBFF' && chs[i+1] <= '\uDFFF')) {
                i++; // skip the next char
                continue;
            }
        } else {
            return false;
        }
    }
    return true;
}
boriwo commented 9 years ago

Just checked in. For future reference, feel free to submit your bug fixes / code changes as pull requests directly :)

hayarobi commented 8 years ago

It still seems to have problem. The patched code can't detect invalid surrogate pairs or orphaned surrogate characters in the middle of string, and is not conform to unicode standard - every codepoints except invalid surrogate pair is allowed in UTF-16; (JAVA String is the form of UTF-16)

My Code is below. `private static final char HIGH_SURROGATE_MIN = '\uD800'; private static final char HIGH_SURROGATE_MAX = '\uDBFF'; private static final char LOW_SURROGATE_MIN = '\uDC00'; private static final char LOW_SURROGATE_MAX = '\uDFFF';

public static boolean isValidUnicode(String msg) {
    int len = msg.length();
    for (int i = 0; i < len; i++) {
        char ch = msg.charAt(i);
        if (HIGH_SURROGATE_MIN <= ch && ch <= HIGH_SURROGATE_MAX) {
            // high surrogate should be followed by low surrogate.
            if (i + 1 < len) {
                char nextCh = msg.charAt(++i);
                if (LOW_SURROGATE_MIN <= nextCh
                        && nextCh <= LOW_SURROGATE_MAX) {
                    continue;
                }
            }
            // The case of no next character or not low surrogate is treated as malformed UTF-16.
            return false;
        } else if (LOW_SURROGATE_MIN <= ch && ch <= LOW_SURROGATE_MAX) {
            // low surrogate without  high surrogate is also treated as malformed UTF-16.
            return false;
        }
        // Anything else are allowed.
    }
    return true;
}

`

The testcode is below. ` public class UtilTest { public static final String EMOJI_SAMPLE = "😀😁AA😉😯BB";

public static final char HIGH = '\uD83D';
public static final char LOW = '\uDE33';

public static final String INTACT_NORMAL_SURROGATE = "A😀"+HIGH+LOW+"😯";
public static final String WRONG_SINGLE_HIGH_SURROGATE = "A😀"+HIGH+"😯";
public static final String WRONG_SINGLE_LOW_SURROGATE = "A😀"+LOW+"😯";
public static final String WRONG_FLIPPED_SURROGATE = "A😀"+LOW+HIGH+"😯";

/**
 * Test method for {@link net.samsung.smail.bms.common.util.Util#isValidUnicode(java.lang.String)}.
 */
@Test
public void testIsValidUnicode() {
    assertTrue(Util.isValidUnicode(EMOJI_SAMPLE));

    assertTrue(Util.isValidUnicode(INTACT_NORMAL_SURROGATE));
    assertFalse(Util.isValidUnicode(WRONG_SINGLE_HIGH_SURROGATE));
    assertFalse(Util.isValidUnicode(INTACT_NORMAL_SURROGATE+HIGH));

    assertFalse(Util.isValidUnicode(WRONG_SINGLE_LOW_SURROGATE));
    assertFalse(Util.isValidUnicode(INTACT_NORMAL_SURROGATE+LOW));

    assertFalse(Util.isValidUnicode(WRONG_FLIPPED_SURROGATE));
    assertFalse(Util.isValidUnicode(INTACT_NORMAL_SURROGATE+LOW+HIGH));

}

} `