chtd / psycopg2cffi

Port to cffi with some speed improvements
Other
177 stars 43 forks source link

UnicodeDecodeError with UTF-8 queries #77

Open karanlyons opened 7 years ago

karanlyons commented 7 years ago
Traceback (most recent call last):
  File "/venv/pypy/site-packages/django/db/models/manager.py", line 85, in manager_method
    return getattr(self.get_queryset(), name)(*args, **kwargs)
  File "/venv/pypy/site-packages/django/db/models/query.py", line 449, in bulk_create
    self._batched_insert(objs_with_pk, fields, batch_size)
  File "/venv/pypy/site-packages/django/db/models/query.py", line 1062, in _batched_insert
    inserted_id = self._insert(item, fields=fields, using=self.db, return_id=True)
  File "/venv/pypy/site-packages/django/db/models/query.py", line 1045, in _insert
    return query.get_compiler(using=using).execute_sql(return_id)
  File "/venv/pypy/site-packages/django/db/models/sql/compiler.py", line 1054, in execute_sql
    cursor.execute(sql, params)
  File "/venv/pypy/site-packages/django/db/backends/utils.py", line 64, in execute
    return self.cursor.execute(sql, params)
  File "/venv/pypy/site-packages/psycopg2cffi/_impl/cursor.py", line 30, in check_closed_
    return func(self, *args, **kwargs)
  File "/venv/pypy/site-packages/psycopg2cffi/_impl/cursor.py", line 244, in execute
    self._query = _combine_cmd_params(query, parameters, conn)
  File "/venv/pypy/site-packages/psycopg2cffi/_impl/cursor.py", line 1002, in _combine_cmd_params
    return b''.join(parts)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 38: ordinal not in range(128)

This might be a personal issue, but it seems like there may still be some lingering encoding issues with this package. I see that there’s some work done in this function from #57, but it’s only under python 3.

I’m getting this error in 2.7 with unicode_literals.

lopuhin commented 7 years ago

Thanks for the report! I don't think it's a personal issue, since it's Django that is creating the query. What Django version are you using?

karanlyons commented 7 years ago

Django 1.10, though I can likely test this in some other versions as well.

I’m fairly confident that just having _combine_cmd_params encode the strings in all versions of Python will do the trick, but I haven’t had a chance to sit down and debug it yet.

karanlyons commented 7 years ago

Okay, the string causing issues is 'Club La Unión'. Since this isn’t in Py3 the strings don’t get encoded. Removing the if six.PY3 checks is enough to fix the problem.

I think what we’re trying to do here is encode the string if it’s unicode, right? If so, just checking against six.text_type is enough for both versions of Python, since in 2.7 that’s unicode, and str in 3.

So if that all sounds right, I think a two line pull request should fix our problems :smile:

lopuhin commented 7 years ago

@karanlyons a PR that fixes an issue for you is more then welcome! Even better if it comes with tests, but if not I'll add them. I'm not sure yet which six.PY3 check you mean, but I haven't debugged the issue yet.

karanlyons commented 7 years ago

Sure, give me a little bit to write the tests and I’ll submit a PR we can look at.

karanlyons commented 7 years ago

Alright, worryingly I’m having a hard time reproducing this in a test. The way this query is generated in my Django codebase is...questionable, but the error also happens with Django’s bulk_create. Still, I can’t get it to throw an error in either CPython or PyPy once I start reducing it down.

I’m not yet convinced that this isn’t actually my fault.

karanlyons commented 7 years ago

Okay, so the issue turned out to be completely different than I thought:

Django has a DateTimeRangeField in django.contrib.postgres.fields but it doesn’t support setting custom bounds. So I had a subclass that allowed me to set bounds for a field and have it instantiate DateTimeTZRange with those bounds. Basically, this:

from __future__ import unicode_literals
from psycopg2.extras import DateTimeTZRange
from django.contrib.postgres.fields import DateTimeRangeField

class DateTimeRangeFieldWithBounds(DateTimeRangeField):
    range_type = DateTimeTZRange

    def __init__(self, *args, **kwargs):
        self.default_bounds = kwargs.pop('default_bounds', '[)')
        super(DateTimeRangeFieldWithBounds, self).__init__(*args, **kwargs)
    ...

unicode_literals means that that the bounds string is actually unicode, which meant (for reasons I haven’t sussed out yet) that the final tstzrange string that made its way into _combine_cmd_params was unicode instead of bytes/str.

Making the bounds kwarg bytes instead of unicode fixes everything in PyPy. In CPython there’s no problem with either.

And...I’m not sure where this leaves us. There definitely is a difference between what happens on CPython with psycopg2 and PyPy with psycopg2cffi, but I’m not totally certain where in that large variable surface the problem actually lies.

lopuhin commented 7 years ago

Thanks for digging into it @karanlyons ! I think I finally understand the issue and what six.PY3 check you were talking about. Looks like in some cases https://github.com/chtd/psycopg2cffi/blob/master/psycopg2cffi/_impl/adapters.py#L288 can return unicode under python 2 as well. I need to check with psycopg2 - if conversion to unicode should always happen in the adapters, I'll fix it there, and if not - then we can just remove six.PY3 check as you suggested.

karanlyons commented 7 years ago

Yeah, I think just removing the six.PY3 checks in _combine_cmd_params should be fine, but we may also want to make sure the return types for adapters have parity between psycopg2 and psycopg2cffi for good measure (especially in case someone’s using these “internal” functions as opposed to just working through the higher level API).

lopuhin commented 7 years ago

@karanlyons hm, I just checked how _combine_cmd_params works with unicode both in CPython 2.7 and in PyPy, and I could not reproduce it yet. This is what I tried:

In [1]: from psycopg2cffi import connect
In [2]: c = connect('host=localhost')
In [3]: from psycopg2cffi._impl.cursor import _combine_cmd_params
In [4]: _combine_cmd_params('select %(foo)s from x', {'foo': u'ё'}, c)
Out[4]: "select '\xd1\x91' from x"

Can you check the arguments (and their types) that cause _combine_cmd_params(query, parameters, conn) to fail?

karanlyons commented 7 years ago
import psycopg2cffi
from psycopg2cffi.extras import DateTimeTZRange
from psycopg2cffi._impl.cursor import _combine_cmd_params
from datetime import datetime

conn = psycopg2cffi.connect("dbname='psycopg2_test' user='postgres' host='localhost'")

ascii_string = 'Test'
unicode_string = u'T\xe9st'
ascii_range = DateTimeTZRange(datetime.now(), datetime.now(), '[]')
unicode_range = DateTimeTZRange(datetime.now(), datetime.now(), u'[]')

# Works
_combine_cmd_params('%s', [ascii_string], conn)
_combine_cmd_params('%s', [unicode_string], conn)
_combine_cmd_params('%s', [ascii_range], conn)
_combine_cmd_params('%s', [unicode_range], conn)

_combine_cmd_params('%s %s', [ascii_string, ascii_range], conn)
_combine_cmd_params('%s %s', [unicode_string, ascii_range], conn)
_combine_cmd_params('%s %s', [ascii_range, unicode_range], conn)

# Fails
_combine_cmd_params('%s %s', [unicode_string, unicode_range], conn)

Please tell me that last one fails for you too. This bug is either more complex than I thought or I’m insane, and I’m really hoping for the former.

lopuhin commented 7 years ago

Yes, that does fail for me, thanks a lot for building a reproducer! Technically this is a bug in how Range objects are adapted, but the actual issue is deeper - b helper defined here https://github.com/chtd/psycopg2cffi/blob/master/psycopg2cffi/extensions.py#L69 is not what you would expect given it's name :) Ideally I should get rid of it - luckily it's used only in _range and in tests, so it should not touch too much code.