divio / aldryn-search

Haystack 2.0 search index for django CMS
Other
48 stars 78 forks source link

Unicodedecodeerror with a CMS page body #74

Closed marksweb closed 7 years ago

marksweb commented 7 years ago

I've got an issue updating indexes coming from TitleIndex where the page body is not being decoded.

Traceback (most recent call last):
  File "/Applications/PyCharm 2.app/Contents/helpers/pydev/pydevd.py", line 1580, in <module>
    globals = debugger.run(setup['file'], None, None, is_module)
  File "/Applications/PyCharm 2.app/Contents/helpers/pydev/pydevd.py", line 964, in run
    pydev_imports.execfile(file, globals, locals)  # execute the script
  File "/Users/mwalker/Sites/myproj/myproj/manage.py", line 22, in <module>
    execute_from_command_line(sys.argv)
  File "/Users/mwalker/Sites/myproj/lib/python2.7/site-packages/django/core/management/__init__.py", line 367, in execute_from_command_line
    utility.execute()
  File "/Users/mwalker/Sites/myproj/lib/python2.7/site-packages/django/core/management/__init__.py", line 359, in execute
    self.fetch_command(subcommand).run_from_argv(self.argv)
  File "/Users/mwalker/Sites/myproj/lib/python2.7/site-packages/django/core/management/base.py", line 294, in run_from_argv
    self.execute(*args, **cmd_options)
  File "/Users/mwalker/Sites/myproj/lib/python2.7/site-packages/django/core/management/base.py", line 345, in execute
    output = self.handle(*args, **options)
  File "/Users/mwalker/Sites/myproj/lib/python2.7/site-packages/haystack/management/commands/update_index.py", line 214, in handle
    self.update_backend(label, using)
  File "/Users/mwalker/Sites/myproj/lib/python2.7/site-packages/haystack/management/commands/update_index.py", line 257, in update_backend
    commit=self.commit, max_retries=self.max_retries)
  File "/Users/mwalker/Sites/myproj/lib/python2.7/site-packages/haystack/management/commands/update_index.py", line 84, in do_update
    backend.update(index, current_qs, commit=commit)
  File "/Users/mwalker/Sites/myproj/lib/python2.7/site-packages/haystack/backends/elasticsearch_backend.py", line 190, in update
    bulk(self.conn, prepped_docs, index=self.index_name, doc_type='modelresult')
  File "/Users/mwalker/Sites/myproj/lib/python2.7/site-packages/elasticsearch/helpers/__init__.py", line 190, in bulk
    for ok, item in streaming_bulk(client, actions, **kwargs):
  File "/Users/mwalker/Sites/myproj/lib/python2.7/site-packages/elasticsearch/helpers/__init__.py", line 162, in streaming_bulk
    for result in _process_bulk_chunk(client, bulk_actions, raise_on_exception, raise_on_error, **kwargs):
  File "/Users/mwalker/Sites/myproj/lib/python2.7/site-packages/elasticsearch/helpers/__init__.py", line 87, in _process_bulk_chunk
    resp = client.bulk('\n'.join(bulk_actions) + '\n', **kwargs)
  File "/Users/mwalker/Sites/myproj/lib/python2.7/site-packages/elasticsearch/client/utils.py", line 69, in _wrapped
    return func(*args, params=params, **kwargs)
  File "/Users/mwalker/Sites/myproj/lib/python2.7/site-packages/elasticsearch/client/__init__.py", line 785, in bulk
    doc_type, '_bulk'), params=params, body=self._bulk_body(body))
  File "/Users/mwalker/Sites/myproj/lib/python2.7/site-packages/elasticsearch/transport.py", line 327, in perform_request
    status, headers, data = connection.perform_request(method, url, params, body, ignore=ignore, timeout=timeout)
  File "/Users/mwalker/Sites/myproj/lib/python2.7/site-packages/elasticsearch/connection/http_requests.py", line 68, in perform_request
    response = self.session.request(method, url, data=body, timeout=timeout or self.timeout)
  File "/Users/mwalker/Sites/myproj/lib/python2.7/site-packages/requests/sessions.py", line 488, in request
    resp = self.send(prep, **send_kwargs)
  File "/Users/mwalker/Sites/myproj/lib/python2.7/site-packages/requests/sessions.py", line 609, in send
    r = adapter.send(request, **kwargs)
  File "/Users/mwalker/Sites/myproj/lib/python2.7/site-packages/requests/adapters.py", line 423, in send
    timeout=timeout
  File "/Users/mwalker/Sites/myproj/lib/python2.7/site-packages/requests/packages/urllib3/connectionpool.py", line 601, in urlopen
    chunked=chunked)
  File "/Users/mwalker/Sites/myproj/lib/python2.7/site-packages/requests/packages/urllib3/connectionpool.py", line 356, in _make_request
    conn.request(method, url, **httplib_request_kw)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 1053, in request
    self._send_request(method, url, body, headers)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 1093, in _send_request
    self.endheaders(body)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 1049, in endheaders
    self._send_output(message_body)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 891, in _send_output
    msg += message_body
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 1815: ordinal not in range(128)

I'm using;

aldryn-search==0.3.0
Django==1.10.7
django-cms==3.4.3
django-haystack==2.6.0
elasticsearch==2.4.1
requests==2.13.0
requests-aws4auth==0.9
awsauth = AWS4Auth(TBH_AWS_ACCESS_KEY, TBH_AWS_SECRET_KEY, AWS_S3_REGION, 'es')
HAYSTACK_CONNECTIONS = {
    'default': {
        'ENGINE': 'haystack.backends.elasticsearch2_backend.Elasticsearch2SearchEngine',
        'URL': 'https://search.eu-west-1.es.amazonaws.com/',
        'INDEX_NAME': 'myproj_dev',
        # The default 10 seconds is typically not enough!
        'TIMEOUT': 30,
        'KWARGS': {
            'port': 443,
            'http_auth': awsauth,
            'use_ssl': True,
            'verify_certs': True,
            'connection_class': elasticsearch.RequestsHttpConnection,
        }
    },
}

Once the process hits urllib3 at the following code block;

        # conn.request() calls httplib.*.request, not the method in
        # urllib3.request. It also calls makefile (recv) on the socket.
        if chunked:
            conn.request_chunked(method, url, **httplib_request_kw)
        else:
            conn.request(method, url, **httplib_request_kw)

I can get the following data out;

>>> httplib_request_kw['body'][1815]
'�'
>>> httplib_request_kw['body'].decode('utf-8')[1815]
u'-'

Is there anything that can be done in the search indexes to avoid this?