Closed oldshuren closed 9 years ago
checked into head
Sorry, I think there is a bug in my fix, here the new one.
public static boolean isValidUnicode(String msg) {
//[\u0020-\uD7FF\uE000-\uFFFD\uD800\uDC00-\uDBFF\uDFFF\r\n\t]
char[] chs = msg.toCharArray();
for (int i = 0; i < chs.length; i++) {
if (chs[i] == '\n' || chs[i] == '\t' || chs[i] == '\r' || (chs[i] >= '\u0020' && chs[i] <= '\uD7FF') || (chs[i] >= '\uE000' && chs[i] <= '\uFFFD')) {
continue;
} else if (i<chs.length-1) { // check for 4 bytes unicode, 2 char (utf16)
if ((chs[i] >= '\uD800' && chs[i+1] >= '\uDC00') && (chs[i] <= '\uDBFF' && chs[i+1] <= '\uDFFF')) {
i++; // skip the next char
continue;
}
} else {
return false;
}
}
return true;
}
Better formating,
public static boolean isValidUnicode(String msg) {
//[\u0020-\uD7FF\uE000-\uFFFD\uD800\uDC00-\uDBFF\uDFFF\r\n\t]
char[] chs = msg.toCharArray();
for (int i = 0; i < chs.length; i++) {
if (chs[i] == '\n' || chs[i] == '\t' || chs[i] == '\r' || (chs[i] >= '\u0020' && chs[i] <= '\uD7FF') || (chs[i] >= '\uE000' && chs[i] <= '\uFFFD')) {
continue;
} else if (i<chs.length-1) { // check for 4 bytes unicode, 2 char (utf16)
if ((chs[i] >= '\uD800' && chs[i+1] >= '\uDC00') && (chs[i] <= '\uDBFF' && chs[i+1] <= '\uDFFF')) {
i++; // skip the next char
continue;
}
} else {
return false;
}
}
return true;
}
Just checked in. For future reference, feel free to submit your bug fixes / code changes as pull requests directly :)
It still seems to have problem. The patched code can't detect invalid surrogate pairs or orphaned surrogate characters in the middle of string, and is not conform to unicode standard - every codepoints except invalid surrogate pair is allowed in UTF-16; (JAVA String is the form of UTF-16)
My Code is below. `private static final char HIGH_SURROGATE_MIN = '\uD800'; private static final char HIGH_SURROGATE_MAX = '\uDBFF'; private static final char LOW_SURROGATE_MIN = '\uDC00'; private static final char LOW_SURROGATE_MAX = '\uDFFF';
public static boolean isValidUnicode(String msg) {
int len = msg.length();
for (int i = 0; i < len; i++) {
char ch = msg.charAt(i);
if (HIGH_SURROGATE_MIN <= ch && ch <= HIGH_SURROGATE_MAX) {
// high surrogate should be followed by low surrogate.
if (i + 1 < len) {
char nextCh = msg.charAt(++i);
if (LOW_SURROGATE_MIN <= nextCh
&& nextCh <= LOW_SURROGATE_MAX) {
continue;
}
}
// The case of no next character or not low surrogate is treated as malformed UTF-16.
return false;
} else if (LOW_SURROGATE_MIN <= ch && ch <= LOW_SURROGATE_MAX) {
// low surrogate without high surrogate is also treated as malformed UTF-16.
return false;
}
// Anything else are allowed.
}
return true;
}
`
The testcode is below. ` public class UtilTest { public static final String EMOJI_SAMPLE = "😀😁AA😉😯BB";
public static final char HIGH = '\uD83D';
public static final char LOW = '\uDE33';
public static final String INTACT_NORMAL_SURROGATE = "A😀"+HIGH+LOW+"😯";
public static final String WRONG_SINGLE_HIGH_SURROGATE = "A😀"+HIGH+"😯";
public static final String WRONG_SINGLE_LOW_SURROGATE = "A😀"+LOW+"😯";
public static final String WRONG_FLIPPED_SURROGATE = "A😀"+LOW+HIGH+"😯";
/**
* Test method for {@link net.samsung.smail.bms.common.util.Util#isValidUnicode(java.lang.String)}.
*/
@Test
public void testIsValidUnicode() {
assertTrue(Util.isValidUnicode(EMOJI_SAMPLE));
assertTrue(Util.isValidUnicode(INTACT_NORMAL_SURROGATE));
assertFalse(Util.isValidUnicode(WRONG_SINGLE_HIGH_SURROGATE));
assertFalse(Util.isValidUnicode(INTACT_NORMAL_SURROGATE+HIGH));
assertFalse(Util.isValidUnicode(WRONG_SINGLE_LOW_SURROGATE));
assertFalse(Util.isValidUnicode(INTACT_NORMAL_SURROGATE+LOW));
assertFalse(Util.isValidUnicode(WRONG_FLIPPED_SURROGATE));
assertFalse(Util.isValidUnicode(INTACT_NORMAL_SURROGATE+LOW+HIGH));
}
} `
The function isValidUnicode() should check 4 bytes unicode. I found this bug when we tried to using emoji :)
The comment in that function actually already included the 4 bytes unicode range, I guess the author just forget implement it.
Here is modified isValidUnicode()