DocNow / diffengine

track changes to the news, where news is anything with an RSS feed
MIT License
177 stars 30 forks source link

UnicodeEncodeError being raised by calls to logging.info #24

Closed ryanfb closed 7 years ago

ryanfb commented 7 years ago
UnicodeEncodeError: 'ascii' codec can't encode character '\u279c' in position 280: ordinal not in range(128)
Call stack:
  File "/usr/local/bin/diffengine", line 11, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.6/site-packages/diffengine/__init__.py", line 464, in main
    tweet_diff(version.diff, f['twitter'])
  File "/usr/local/lib/python3.6/site-packages/diffengine/__init__.py", line 420, in tweet_diff
    logging.info("tweeted %s", status)
Message: 'tweeted %s'
Arguments: ('Trump wants good relationship with Russia, May says sanctions should stay | Reuters https://wayback.archive.org/web/20170127111722/http://www.reuters.com/article/us-usa-trump-britain-idUSKBN15B104?feedType=RSS&feedName=politicsNews \u279c https://wayback.archive.org/web/20170127193013/http://www.reuters.com/article/us-usa-trump-britain-idUSKBN15B104?feedType=RSS&feedName=politicsNews',)

Should we explicitly call status.encode('utf-8') before logging? http://stackoverflow.com/questions/9942594/unicodeencodeerror-ascii-codec-cant-encode-character-u-xa0-in-position-20 I've also now set LC_ALL='en_US.utf8' in my crontab as suggested by another answer there to see if that fixes it as well.

ryanfb commented 7 years ago

For what it's worth, I tried explicitly encoding to UTF-8 (or ascii with the 'ignore' parameter) in a bunch of different places in the code and couldn't figure out how to resolve this that way, though I'm not all that familiar with Python string encoding issues. Setting LC_ALL='en_US.UTF-8' (not LC_ALL='en_US.utf8') in my crontab seems to have resolved this for me.

jeremylow commented 7 years ago

I couldn't reproduce this on my system, but that may be a setting lurking somewhere that fixes the issue for me in particular.

Here's the settings that I use for my loggers, which seem to work fine across my various servers:

# Somewhere in utilities.py
def set_up_logging(log_file=None, level=logging.INFO, source=__name__):
    """Convenience function to set up a reasonable logger"""
    logger = logging.getLogger(source)
    logger.setLevel(level)

    handler = logging.handlers.RotatingFileHandler(
        filename=log_file,
        maxBytes=1048576,
        backupCount=5,
        encoding='utf8',)
    formatter = logging.Formatter(
        '%(asctime)s - %(name)s - %(levelname)s - %(funcName)s() - %(message)s')
    handler.setFormatter(formatter)
    logger.addHandler(handler)

return logger

and then I'll call logger = utilities.set_up_logging(log_file='devel.log', level='debug', source=__name__) at the top of whatever file I'm working with.

It may be worthwhile to explicitly set the shebang (since python3 is required) and encoding of the __init__.py file with:

#!/usr/bin/env python3
# -*- coding: utf-8 -*-

as well, since there are Unicode characters in the source encoding itself.

edsu commented 7 years ago

I have never seen this. If it is a problem you should see it every time a tweet is sent since \u279c is the arrow between the before and after Internet Archive URLs that are part of every tweet.