RedisJSON / redisjson-py

An extension to redis-py for using Redis' ReJSON module
https://redisjson.io
BSD 2-Clause "Simplified" License
160 stars 34 forks source link

Unicode / Encoding issues #19

Open larswise opened 5 years ago

larswise commented 5 years ago

I'm facing some encoding issues with the client; The problem are when using non ascii characters, more precisely æøåÆØÅ etc.

client.jsonset("test", Path.rootPath(), {'name': 'test111', 'items': []})

client.jsonget('test') --> {'name': 'test111', 'items': []}

client.jsonarrinsert('test', Path('.items'), 0, {'company': 'Åre', 'destination': 'ÅS', 'origin': 'LØR'})

client.jsonget('test')

This does not look correct? {"name":"test111","items":[{"company":"\u00c3\u0085re","destination":"\u00c3\u0085S","origin":"L\u00c3\u0098R"},]}

What i had expected: {"name":"test111","items":[{"company":"\u00c5re","destination":"\u00c5S","origin":"L\u00d8R"},]} or {"name":"test111","items":[{"company":"\xc5re","destination":"\xc5S","origin":"L\xd8R"},]}

If i save as strings they appear to get the correct encoding, but then my array elements are turned in to strings instead of objects

If I'm doing it wrong, I'd be greatful for any tips! :)

bentsku commented 5 years ago

Hello!

I just tried it too and same result with rejson-py. But I checked with the ReJSON CLI tool, to no avail. The problem stays the same. See the screenshot attached.

screenshot 2019-02-21 at 00 51 17

I think the problem comes from the ReJSON internal encoding, and not the Python client. Maybe you could check if there is an open issue there or open one to see if they could help you ?

larswise commented 5 years ago

I did manage to get around it:

In python I am able to restore the string by encoding as follows: somevalue.encode('utf-8').decode('unicode-escape').encode('latin1').decode('utf-8')

and similarly in .NET after fetching with JSON.MGET

        public static string GetEncoded(params string[] strings)
        {
            var lat1 = System.Text.Encoding.GetEncoding("iso-8859-1");
            Regex rx = new Regex(@"\\[uU]([0-9A-Fa-f]{4})");
            var combined = string.Join(",", strings);
            var result = rx.Replace(combined, match => ((char)Int32.Parse(match.Value.Substring(2), System.Globalization.NumberStyles.HexNumber)).ToString());
            var lat1bytes = lat1.GetBytes(result);
            return System.Text.Encoding.UTF8.GetString(lat1bytes);
        }
mschipperheyn commented 5 years ago

Having problems as well.

JSON.SET foo . '"bãr"'
OK
JSON.GET foo .
"\"b\\u00c3\\u00a3r\""

When I remove the duplicate \ and decode the result bãr

bentsku commented 5 years ago

I believe there is now an option to decode special character with a no-escape option in the JSON.GET command as said in the replies of this issue. Maybe we could add it as an option for the python command? I can try to add it if wanted.

RedisJSON/RedisJSON#98

gkorland commented 5 years ago

@bentsku if you can submit a PR that will be great