Unicode characters on reports throw UnicodeEncodeError

FodT commented 8 years ago

CSVReader seems to choke on utf-8 chars: UnicodeEncodeError: 'ascii' codec can't encode character u'\u2013' in position 88: ordinal not in range(128)

possible fix:

Add a utf_8 encoding function:

    def utf_8_encoder(self,unicode_csv_data):
        for line in unicode_csv_data:
            yield line.encode('utf-8')

Change the get function to encode the iter object before creating the CSV Reader object.

    def get(self, as_dict=False):
        """Get report data. Returns tuple (headers, csv.Reader).

        If as_dict == True, return (headers, csv.DictReader).
        """
        if not hasattr(self, 'report'):
            raise ClientError("Can't run get without report!")

        params = {}
        for key, value in six.iteritems(self.parameters):
            if self._fields[key]:
                params[key] = self._fields[key](value)
            else:
                params[key] = value

        iter_ = self._get(self.report,
                          params=params).iter_lines(decode_unicode=True)

        if as_dict:
            reader = csv.DictReader(self.utf_8_encoder(iter_))
            headers = reader.fieldnames
        else:
            #reader = csv.reader(iter_)
            reader = csv.reader(self.utf_8_encoder(iter_))
            headers = next(reader)
        return headers, reader

MatiasSMd commented 8 years ago

This issue affects us too. The proposed solution seems to work in our case. This should be marked as urgent/critical since it breaks the functionality.

pswaminathan commented 8 years ago

Can you give me an example of where this is breaking? My memory is slightly hazy, but I recall iter_lines(decode_unicode=True) was there for that reason. Was this just wrong? I'll try and reproduce.

MatiasSMd commented 8 years ago

In my case it breaks when decoding some of the data. Because of the unzipping and chunks handling of the responses it was difficult for me to catch the exact info that fails to decode. I think the problem rises in models.Response.iter_content.

If you give me some hint where to dig further I help you to spot the exact place.

pswaminathan commented 8 years ago

I'm more looking for sample code and Python version. Happy to do the digging :smile:

MatiasSMd commented 8 years ago

In fact, the problem seems to be the actual decoding. If I set:

iter_ = self._get(self.report,
                          params=params).iter_lines(decode_unicode=False)

It works (and that is why the re-encoding proposed seems to fix the issue)

pswaminathan commented 8 years ago

Yeah I figured as much. What version of Python are you using? Let me dig into why I added the decode_unicode arg in the first place. I have a feeling it's a 2/3 difference.

MatiasSMd commented 8 years ago

The problem is that the code is not what is causing the issue, but the data received. And I can't share the data (but I guess any unicode char would do the trick).

I'm using python 2.7 BTW

MatiasSMd commented 8 years ago

LOL I guess the same, looks like a Python 3 adaptation issue :p (or Python < 3 adaptation issue, it depends on which was the first used)

pswaminathan commented 8 years ago

@MatiasSMd this should work for you now. Let us know if you run into any more issues—thanks!

MediaMath / t1-python

Unicode characters on reports throw UnicodeEncodeError #63