Open wataradio opened 1 year ago
For example, the following characters' UTF-8 byte sequence end with \xe2, \x80 or \x8b, so the same problem occurs.
Thanks for the extra info, I think I do now understand the cause. The intent of the trim() was to remove the U+2000, i.e. a multibyte character of three pieces/bytes. However, because trim() it is not multibyte aware, it handles it as three separate characters.
So we should use here str_replace()
? Does that work?
$tags = str_replace("\xe2\x80\x8b", '', $tags); // strip word/wordpad breaklines(U+200b)
Thanks, I think it works well.
I confirmed the following small test code worked expectedly.
<?php
$str = "\xE4\xB8\x80"; // "一"
$zero_width_space = "\xe2\x80\x8b"; // U+200b ZERO WIDTH SPACE
$tags = $zero_width_space . $str . $zero_width_space;
$tags = str_replace("\xe2\x80\x8b", '', $tags);
// expecting \xe4\xb8\x80 would be printed
$bytes = unpack('C*', $tags);
foreach ($bytes as $byte) {
echo '\\x' . dechex($byte);
}
?>
Thanks for testing. trim() works only on the end of the string, str_replace() everywhere. I think that is fine for tags. I will implement it.
Tag plugin doesn't work when specific Japanese characters, e.g '一'(
U+4E00
), exist in a tag like as follows.Because '一's UTF-8 byte sequence(
\xE4\xB8\x80
) get corrupted by the following code insyntax_plugin_tag_tag::handle(tag.php)
.It removes
\x80
from\xE4\xB8\x80
('一's UTF-8 byte sequence), and its result becomes an invalid sequence\xE4\xB8
.