cinchrb / cinch

The IRC Bot Building Framework
http://www.rubydoc.info/gems/cinch
MIT License
1k stars 180 forks source link

Fix a bug in splitting messages #189

Closed ochaochaocha3 closed 9 years ago

ochaochaocha3 commented 9 years ago

I fixed a bug in splitting messages containing multibyte characters.

Issue

For example, the following Japanese message got broken by split.

# UTF-8 (the bytesize of a Japanese character is 3)
# maxlength_without_end == 425
私はその人を常に先生と呼んでいた。だからここでもただ先生と書くだけで本名は打ち明けない。これは世間を憚かる遠慮というよりも、その方が私にとって自然だからである。私はその人の記憶を呼び起すごとに、すぐ「先生」といいたくなる。筆を執っても心持は同じ事である。よそよそしい頭文字などはとても使う気にならない。
----
# Using ngIRCd
23:28:41 (msg_split) 私はその人を常に先生と呼んでいた。だからここでもただ先生と書くだけで本名は打ち明けない。これは世間を憚かる遠慮というよりも、その方が私にとって自然だからである。私はその人の記憶を呼び起すごとに、すぐ「先生」といいたくなる。筆を執っても心持は同じ事である。よそよそしい頭文字などはとて�[CUT]
23:28:41 (msg_split) ...

Previously Cinch::Target#send split such messages at the wrong position that was calculated from the number of characters instead of the byte size.

dominikh commented 9 years ago

Thank you for your contribution. I took your patch and simplified it slightly, You can check it out in the byte-split branch (https://github.com/cinchrb/cinch/blob/byte-split/lib/cinch/target.rb)

Could you please test that branch and see if it works correctly for you?

ochaochaocha3 commented 9 years ago

Thank you for improving my code. I added some test cases and fixed the position calculation algorithm slightly (please see the commit message of ecb4d1c).

ochaochaocha3 commented 9 years ago

And I changed to use "\z" (substitute character; 0x1A) as the mark of a space. Because "\u00A0" is a 2-byte character, it shorten split messages.

dominikh commented 9 years ago

I'm afraid you messed up when you merged my byte-split branch into your branch. Now there's two implementations of the message splitting, once in the split_message method, and once in send, the latter containing your original code, not my modified one.

ochaochaocha3 commented 9 years ago

I'm sorry not to confirm the merge commit carefully. Could you please choose 8119249 or a new PR #190?

dominikh commented 9 years ago

I've merged your changes as 7b8a0181a1f4445f834244912e07834567d0cb3a. I did rewrite and reorder them some, as well as add an encoding header to the target.rb test file, but all in all it should be pretty close to the state in #190 (just with more of the history preserved.)

Thanks for your contribution.