dib-lab / khmer

In-memory nucleotide sequence k-mer counting, filtering, graph traversal and more
http://khmer.readthedocs.io/
Other
757 stars 295 forks source link

'pip show khmer' fails due to author unicode #1565

Open ctb opened 7 years ago

ctb commented 7 years ago
% pip show khmer
Name: khmer
Version: 2.0+715.gd841a57
Summary: khmer k-mer counting library
Home-page: https://khmer.readthedocs.io/ 
--- Logging error ---
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/logging/__init__.py", line 982, in emit
    stream.write(msg)
UnicodeEncodeError: 'ascii' codec can't encode character '\xe1' in position 277: ordinal not in range(128)
Call stack:
  File "/Users/t/dev/jup/bin/pip", line 11, in <module>
    sys.exit(main())
  File "/Users/t/dev/jup/lib/python3.5/site-packages/pip/__init__.py", line 233, in main
    return command.main(cmd_args)
  File "/Users/t/dev/jup/lib/python3.5/site-packages/pip/basecommand.py", line 215, in main
    status = self.run(options, args)
  File "/Users/t/dev/jup/lib/python3.5/site-packages/pip/commands/show.py", line 42, in run
    results, list_files=options.files, verbose=options.verbose):
  File "/Users/t/dev/jup/lib/python3.5/site-packages/pip/commands/show.py", line 133, in print_results
    logger.info("Author: %s", dist.get('author', ''))
Message: 'Author: %s'
Arguments: ("Michael R. Crusoe, Hussien F. Alameldin, Sherine Awad, Elmar Bucher, Adam Caldwell, Reed Cartwright, Amanda Charbonneau, Bede Constantinides, Greg Edvenson, Scott Fay, Jacob Fenton, Thomas Fenzl, Jordan Fish, Leonor Garcia-Gutierrez, Phillip Garland, Jonathan Gluck, Iv\xe1n Gonz\xe1lez, Sarah Guermond, Jiarong Guo, Aditi Gupta, Joshua R. Herr, Adina Howe, Alex Hyer, Andreas H\xe4rpfer, Luiz Irber, Rhys Kidd, David Lin, Justin Lippi, Tamer Mansour, Pamela McA'Nulty, Eric McDonald, Jessica Mizzi, Kevin D. Murray, Joshua R. Nahum, Kaben Nanlohy, Alexander Johan Nederbragt, Humberto Ortiz-Zuazaga, Jeramia Ory, Jason Pell, Charles Pepe-Ranney, Zachary N Russ, Erich Schwarz, Camille Scott, Josiah Seaman, Scott Sievert, Jared Simpson, Connor T. Skennerton, James Spencer, Ramakrishnan Srinivasan, Daniel Standage, James A. Stapleton, Joe Stein, Susan R Steinman, Benjamin Taylor, Will Trimble, Heather L. Wiencko, Michael Wright, Brian Wyss, Qingpeng Zhang, en zyme, C. Titus Brown",)
Author-email: khmer-project@idyll.org
License: UNKNOWN
Location: /Users/t/dev/khmer
Requires: screed, bz2file
betatim commented 7 years ago

What does echo $LANG or locale say for you? On a machine with LANG=en_US.utf-8 it works for me.

betatim commented 7 years ago

Can't reproduce this locally.

betatim commented 7 years ago

bump @ctb can you tell us your $LANG?

ctb commented 7 years ago
% pip install https://github.com/dib-lab/khmer/archive/master.zip
...
% pip show khmer
...
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/logging/__init__.py", line 982, in emit
    stream.write(msg)
UnicodeEncodeError: 'ascii' codec can't encode character '\xe1' in position 277: ordinal not in range(128)
...
% echo $LANG
en_US.UTF-8
betatim commented 7 years ago

I'm a bit stumped by this :-/ We have non ascii characters in the list of contributors, encoding them as ascii (unsurprisingly) doesn't work ... tried to find out how python determines the default encoding to use because I thought it looked at $LANG and friends (but evidently not).

What do these two say:

$ python -c 'import sys; print(sys.getdefaultencoding())'
$ python -c 'import sys; print(sys.stdin.encoding, sys.stdout.encoding)'

Some guesses:

ctb commented 7 years ago
% python -c 'import sys; print(sys.getdefaultencoding())'
utf-8
% python -c 'import sys; print(sys.stdin.encoding, sys.stdout.encoding)'
US-ASCII US-ASCII
betatim commented 7 years ago

The value of sys.stdout.encoding isn't set by LANG but by LC_CTYPE or at least it is also influenced by CTYPE. So you can predict my question: what is it set to for you? :)

More generally, not sure how we/khmer can fix this. If a user legitimately has their terminals encoding set to ASCII then pip can not print the authors of khmer. Is this a bug in pip or a case of human asking a computer to do the impossible? Don't have a better idea other than the suboptimal "mutilating people's names so they don't contain non ascii" or we leave it as won't fix.

ctb commented 7 years ago

On Tue, Apr 18, 2017 at 01:20:11AM -0700, Tim Head wrote:

The value of sys.stdout.encoding isn't set by LANG but by LC_CTYPE or at least it is also influenced by CTYPE. So you can predict my question: what is it set to for you? :)

% echo $LC_CTYPE C

so that's it -- and indeed when I fix that, it all works.

More generally, not sure how we/khmer can fix this. If a user legitimately has their terminals encoding set to ASCII then pip can not print the authors of khmer. Is this a bug in pip or a case of human asking a computer to do the impossible? Don't have a better idea other than the suboptimal "mutilating people's names so they don't contain non ascii" or we leave it as won't fix.

I think #wontfix is fine and will set it accordingly. But it's nice to have this in the issue tracker ;).

betatim commented 7 years ago

👍

ctb commented 7 years ago

This is now cropping up in all 'setup.py' executions -- I've had to run

export LC_CTYPE=utf-8

to build khmer v2.1.

wltrimbl commented 7 years ago

I think this is a problem with pip. authors has the str type, which is the right type in python3, but pip can't handle it.

standage commented 6 years ago

I think @wltrimbl is right here. I'm starting to connect some dots here.

I spent an inordinate amount of time last night troubleshooting a Docker build issue for kevlar. I could get khmer[1] and kevlar to install just fine using the Python 2.7 toolscape, but when it came to actually running kevlar it would fail since it dropped 2.7 support the same time khmer did. Once I changed the Docker build config to the Python 3.x toolscape (python3-dev, pip3, etc.) it would fail at the khmer installation step citing the same ascii encoding error. Searching the interwebs for solutions brought up many people experiencing similar problems. Some problems are claimed to be fixed in the still-to-be-released pip v10, but even installing that did not work for khmer.

The (currently disabled) Docker build that runs with our CI also uses the Python 2.7 toolscape. This fails with the latest master, as would be expected since we dropped 2.x support several months ago. Updating the Docker config to the Python 3.5 toolscape raises the same ascii encoding error as before.

So:

Sad to say, the easiest solution will probably be to do a (hopefully temporary) projection of the author names onto ASCII space until the pip issues are ironed out.


[1] It turns out I was using an older version of khmer that still supported Python 2.7. The latest master will not install successfully using pip2.