trouble with utf-8 cyrillic symbols

kir3d commented 8 years ago

On website it works but received pattern |xxxx when tried:

from cleverbot import Cleverbot

cb = Cleverbot() message=cb.ask('привет')

brw commented 8 years ago

Temporary workaround: http://stackoverflow.com/a/11281948/5082094

eval('u"""' + message.replace('|', r'\u').replace('"', r'\"') + '-"""')[:-1]

kir3d commented 8 years ago

Thank you very much!

folz commented 8 years ago

Strongly recommend you avoid "eval" because of security reasons related to eval-ing untrusted input.

That said, do either of you have suggestions for resolving this? Coming from english, I don't know what look for while working with non-ascii characters.

kir3d commented 8 years ago

I see threat 'eval'.

English work perfect, cyrillic answer likes '|041F|0440|0438|0432|0435|0442.'

Python3 prevents replace "|" to "\u" (single backslash).

message.replace("|",r"\u") return '\u041F\u0440\u0438\u0432\u0435\u0442.' message.replace("|","\u") return SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-1: truncated \uXXXX escape

brw commented 8 years ago

Strongly recommend you avoid "eval" because of security reasons related to eval-ing untrusted input.

Do you really expect Cleverbot to send back 'untrusted' input? Using eval is perfectly fine in this case.

folz commented 8 years ago

"Untrusted" in this context just means "anything that wasn't generated by your program", so yes.

brw commented 8 years ago

Any way eval could be abused in this case?

kir3d commented 8 years ago

Cleverbot is self-learning? Can use users quotes? Yes. It's mean once time some user can wrote potential dangerous command and Cleverbot once time use this command like answer and eval with have enough permission can execute this command. IMHO.

brw commented 8 years ago

I just tested this:

import os

print(eval('u"""' + "os.system('echo Test')".replace('|', r'\u').replace('"', r'\"') + '-"""')[:-1])
print(eval("os.system('echo Test')"))

Output:

os.system('echo Test')
Test
0

Not sure, but I believe that this means that it is actually safe to use then? Or are there still other ways to abuse eval?

kir3d commented 8 years ago

http://nedbatchelder.com/blog/201206/eval_really_is_dangerous.html http://stackoverflow.com/questions/13066594/is-there-a-way-to-secure-strings-for-pythons-eval http://tav.espians.com/a-challenge-to-break-python-security.html

May be will help sandbox https://github.com/haypo/pysandbox or http://doc.pypy.org/en/latest/sandbox.html ?

NyanKiyoshi commented 8 years ago

Otherwise, you could extract the unicode data and replace it with the associated unicode character (by converting the UTF-16 data (hexadecimal) to a integer and then, chr on Python3 or unichr on Python2) by using a regex (or by using another way).

For example:

>>> cleverbot_utf_16_word = re.compile(r'\|([0-9A-F]{4})')
>>> response = '|041F|0440|0438|0432|0435|0442.'
>>> re.sub(cleverbot_utf_16_word, lambda matchobj: chr(int(matchobj.group(1), 16)), response)
'Привет.'

It's pretty dirty but some improvements can be done :)

kir3d commented 8 years ago

NyanKiyoshi, nice! Thank you!

brw commented 8 years ago

Ah that's way better, thanks @NyanKiyoshi!

pawollo commented 8 years ago

I have some trouble in using german unicode-characters, too. Unfortunately the method suggested by @NyanKiyoshi doesn't help. The problem shows up as following: I can send all unicode-characters without problem, just when the Bot is answering something like ä,ö,ü,ß i get the following output:

Traceback (most recent call last): File "cbtest.py", line 21, in main() File "cbtest.py", line 16, in main print('>> Cleverbot: {}'.format(answer)) UnicodeEncodeError: 'ascii' codec can't encode characters in position 1-2: ordinal not in range(128)

This problem is solved quite easily by leaving out the format()-function, i can convert those characters before output later easily. The problem giving me more trouble is, that one message later the urlencode()-function is giving me the following traceback:

You: Tschau! Cleverbot: {}TschÃ¼ss. You: Tschau Tschau Traceback (most recent call last): File "cbtest.py", line 21, in main() File "cbtest.py", line 15, in main answer = cleverbot_client.ask(question) File "/home/user/dev/python/cleverbot.py", line 96, in ask resp = self._send() File "/home/user/dev/python/cleverbot.py", line 133, in _send enc_data = urlencode(self.data) File "/usr/lib/python2.7/urllib.py", line 1349, in urlencode v = quote_plus(str(v)) UnicodeEncodeError: 'ascii' codec can't encode characters in position 4-5: ordinal not in range(128)

(I already feed the _parse()-function with the RE-patch from @NyanKiyoshi, so the input to urlencode should be already cleaned up) When i try a workaround like http://stackoverflow.com/questions/6480723/, converting erverything properly to the crappy iso-8859-1 (just to be sure to use the same encoding as the cb-page), i get an DENIED error from the CB-page. Probably the session-token is calculated wrong. But at least the urlencode-function() is working and i get an answer from the server.

Some good questions to test this is to bother the guy with questions like "Hallo" (hello) until it is speaking in german to you and "Tschüss!" (good bye), probably it will answer to "Tschüss" in the same way and you get your testing-output with unicode-char from the bot. Otherwise take some single sentences from german/ - wikipedia sites to feed it.

--update-- this is the content of self.data before the crash (notice last answer from Bot in 'Text2' - u'F\xc3\xa4hrst du Auto.', should be interpreted as 'Fährst du Auto.')

OrderedDict([(u'stimulus', 'nein!'), (u'cb_settings_language', u''), (u'cb_settings_scripting', u'no'), (u'islearning', 1), (u'icognoid', u'wsf'), (u'icognocheck', '37b69ae1ab44a5e9fdc1b9ca82f77abf'), (u'start', u'y'), (u'sessionid', u''), (u'vText8', u'.'), (u'vText7', ''), (u'vText6', u''), (u'vText5', 'ich spreche deutsch'), (u'vText4', u'Ich auch.'), (u'vText3', 'toll'), (u'vText2', u'F\xc3\xa4hrst du Auto.'), (u'fno', 0), (u'prevref', u''), (u'emotionaloutput', u''), (u'emotionalhistory', u''), (u'asbotname', u''), (u'ttsvoice', u''), (u'typing', u''), (u'lineref', u''), (u'sub', u'Say'), (u'cleanslate', False)])

Traceback (most recent call last): File "cbtest.py", line 21, in main() File "cbtest.py", line 15, in main answer = cleverbot_client.ask(question) File "/home/user/dev/python/cleverbot.py", line 96, in ask resp = self._send() File "/home/user/dev/python/cleverbot.py", line 139, in _send enc_data = urlencode(self.data) File "/usr/lib/python2.7/urllib.py", line 1349, in urlencode v = quote_plus(str(v)) UnicodeEncodeError: 'ascii' codec can't encode characters in position 1-2: ordinal not in range(128)

-- update 2-- got the mojo working: I added following test-conversion before the urlencode-function: for item in self.data: print(repr(item)+ ' - '+repr(self.data[item])) try: self.data[item]=self.data[item].encode('iso-8859-1') except: pass i'm a little bit confused, because i tried this conversion earlier in the _parse()-function and that didn't work for me. As i'm quite new to python (bloody new) i would ask you for pulling this idea in a proper way into the project-code.

NyanKiyoshi commented 8 years ago

Does it work like this? https://github.com/NyanKiyoshi/cleverbot.py/commit/4a25053f0bc295c0b4968f256202b280afe38003

  [02:43:40]  @Chocolat | !q Du bist eine mädchen?
  [02:43:41]  @WuWu     | Chocolat: ja ich bin ein Mädchen und 14 Jahre alt.

Y4kuzi commented 8 years ago

I just tested above example, and I got this as reply:

>>> cb.ask('Du bist eine mädchen?')
'Was bist du Junge oder MÃ¤dchen?'

Umlaut characters and characters with tildes (example, ñ shows as Ã±) do not show correctly.

[Edit] Nvm, got it. Sorry, I'm not familiar with Github.

mrgigabyte commented 8 years ago

Y4kuzi, check the pull request i have fixed that issue :)

folz commented 8 years ago

For those following along at home, please update to v1.1.0 on pypi and let me know if this fixes your problem. It includes @mrgigabyte's decode-encode fix from #33.

folz / cleverbot.py

trouble with utf-8 cyrillic symbols #21