dokufreaks / plugin-tag

Assign category tags to wiki pages
http://dokuwiki.org/plugin:tag
GNU General Public License v2.0
54 stars 38 forks source link

doesn't work when specific Japanese characters exist in a tag #250

Open wataradio opened 1 year ago

wataradio commented 1 year ago

Tag plugin doesn't work when specific Japanese characters, e.g '一'(U+4E00), exist in a tag like as follows.

{{tag> 一}}

Because '一's UTF-8 byte sequence(\xE4\xB8\x80) get corrupted by the following code in syntax_plugin_tag_tag::handle(tag.php).

$tags = trim($tags, "\xe2\x80\x8b"); // strip word/wordpad breaklines(U+200b)

It removes \x80 from \xE4\xB8\x80('一's UTF-8 byte sequence), and its result becomes an invalid sequence \xE4\xB8.

wataradio commented 1 year ago

For example, the following characters' UTF-8 byte sequence end with \xe2, \x80 or \x8b, so the same problem occurs.

Klap-in commented 1 year ago

Thanks for the extra info, I think I do now understand the cause. The intent of the trim() was to remove the U+2000, i.e. a multibyte character of three pieces/bytes. However, because trim() it is not multibyte aware, it handles it as three separate characters.

So we should use here str_replace()? Does that work?

$tags = str_replace("\xe2\x80\x8b", '', $tags); // strip word/wordpad breaklines(U+200b)
wataradio commented 1 year ago

Thanks, I think it works well.

I confirmed the following small test code worked expectedly.

<?php
$str = "\xE4\xB8\x80"; // "一"
$zero_width_space = "\xe2\x80\x8b"; // U+200b ZERO WIDTH SPACE
$tags = $zero_width_space . $str . $zero_width_space;

$tags = str_replace("\xe2\x80\x8b", '', $tags); 

// expecting \xe4\xb8\x80 would be printed
$bytes = unpack('C*', $tags);
foreach ($bytes as $byte) {
    echo '\\x' . dechex($byte);
}
?>
Klap-in commented 1 year ago

Thanks for testing. trim() works only on the end of the string, str_replace() everywhere. I think that is fine for tags. I will implement it.