ejwa / gitinspector

:bar_chart: The statistical analysis tool for git repositories
GNU General Public License v3.0
2.35k stars 324 forks source link

UTF-8 characters in authors leads to a crash #161

Open CFoltin opened 6 years ago

CFoltin commented 6 years ago

Hi,

we had names with german letters like 'ö' inside. They terminated the application on text output.

To correct the bug, I've changed line 151 in changesoutput.py to the following line:

            print(str(i.encode(sys.stdout.encoding, errors='replace')).ljust(20), end=" ")

and imported sys.

HTH, Chris

adam-waldenberg commented 6 years ago

Hi.

What encoding is the terminal ? What version of Python are you using? What exception are you getting?

You are doing errors="replace" here which works around the problem and doesn't really "solve" it. I suspect your terminal is not really set to UTF-8 and this is the actual reason for your issues.

CFoltin commented 6 years ago

Hi, sorry for late reply.

Encoding: $ python -c "import sys; print(sys.stdout.encoding)" ISO-8859-1

Python version: $ python --version Python 2.7.6

Error message:

Traceback (most recent call last): File "/home/user/Downloads/gitinspector-master/gitinspector.py", line 24, in gitinspector.main() File "/home/user/Downloads/gitinspector-master/gitinspector/gitinspector.py", line 206, in main run.process(repos) File "/home/user/Downloads/gitinspector-master/gitinspector/gitinspector.py", line 83, in process outputable.output(ChangesOutput(summed_changes)) File "/home/user/Downloads/gitinspector-master/gitinspector/output/outputable.py", line 43, in output outputable.output_text() File "/home/user/Downloads/gitinspector-master/gitinspector/output/changesoutput.py", line 151, in output_text print(terminal.ljust(i, 20)[0:20 - terminal.get_excess_column_count(i)], end=" ") UnicodeEncodeError: 'latin-1' codec can't encode character u'\ufffd' in position 2: ordinal not in range(256)

With the following settings, not the right character appears, but it works: $ export LC_ALL=de_DE.utf8 $ export LANG="$LC_ALL"

Thanks, Chris

amorphius commented 6 years ago

I have the same issue beside I have UTF-8 locale

$ gitinspector --version
...
...
    raise ValueError, 'unknown locale: %s' % localename
ValueError: unknown locale: UTF-8
M3kH commented 6 years ago

I would say that this could solve.

Add to ~/.bash_profile

export LC_ALL=en_US.UTF-8
export LANG=en_US.UTF-8

or whatever preference you want. Seems just a generic issue with Python on Mac.

adam-waldenberg commented 6 years ago

@CFoltin Hi.

Sorry. I forgot about this issue. In any case - it's completely normal. If your terminal can't support the character, python has no way of outputting it.

The export should do the trick though. But maybe something is still not set to UTF-8. You can try setting PYTHIONIOENCODING to utf8 or redirecting to a file - in which case these problems should never occur.

devcurmudgeon commented 6 years ago

Ironically I just hit this same issue today. I don't think it's "invalid", tbh.

imo gitinspector shouldn't crash just because it hits an odd character in git metadata...

Traceback (most recent call last): File "/usr/lib/python2.7/threading.py", line 810, in __bootstrap_inner self.run() File "/usr/local/lib/python2.7/dist-packages/gitinspector/blame.py", line 113, in run self.handle_blamechunk_content(row) File "/usr/local/lib/python2.7/dist-packages/gitinspector/blame.py", line 81, in handle_blamechunk_content author = self.changes.get_latest_author_by_email(self.blamechunk_email) File "/usr/local/lib/python2.7/dist-packages/gitinspector/changes.py", line 186, in get_latest_author_by_email name = name.decode("unicode_escape", "ignore") UnicodeEncodeError: 'ascii' codec can't encode character u'\u0153' in position 15: ordinal not in range(128)

adam-waldenberg commented 6 years ago

@devcurmudgeon

There are only a few options here.

  1. Ignore it and replace all characters that can not be outputted - this is something I'd rather not do, as the output will be invalid.
  2. Catch it and print out exactly the same error message ;)... Or a similar one. No point in that.
  3. Catch it, inform the user and output it with invalid characters replaced. Problem is that this will garble the output in the terminal (even if you use stderr for the warning).

This has been discussed so many times before (and not only in this project, mind you). I think it's better to just leave it, as these exceptions are very informative in python - it's also a common issue. If you plan on outputting unicode characters you best have a terminal set up to handle it. In your case, it's configured for ascii output and you are trying to output a œ character.

We actually have the following function,

https://github.com/ejwa/gitinspector/blob/6d77989e341e043c9a7f09757000d75701b32d84/gitinspector/terminal.py#L128

This warns on mis-configured terminals that return "None" as encoding. However, it does not warn on ascii. Ascii may actually be OK if you happen to have a repo that only outputs standard ascii characters when you run gitinspector.

CFoltin commented 6 years ago

Hi, this problem had cost me about 3 hours to get fixed. So, I would propose to improve the user's experience. The option to catch it and to print a remedia would be way better to leave the user with a stack trace, which appears after ~20min which the tool needed in my case to analyze the repo.

Just, my 2 cents.

BTW: In any case, this error message seems to be found....

devcurmudgeon commented 6 years ago

@CFoltin i'm with you. I was thinking of using gitinspector as part of a CI pipeline that builds a custom Linux distro ... approximately 700 repos. it worked well on some samples... but then crashed our pipeline, on a linux-api-headers repo, a couple of hours into the run.

Upstream folks obviously can make make their own choices about what they want to fix, but as a user I'm not interested in informative python stack-traces, I just want working software :-)

Note - i've taken plenty of heat from users moaning about stack traces in my own projects :-)

devcurmudgeon commented 6 years ago

@adam-waldenberg thanks for your reply. I may be wrong but I think you missed at least a couple of options:

Crashing a whole run because of an unexpected (but valid) character in a git repo's metadata doesn't seem like correct behaviour to me

adam-waldenberg commented 6 years ago

@devcurmudgeon UTF-8 is forced on redirection. I won't be forcing UTF-8 on terminal output, because it's not always needed. Output also needs to work on other environments with extended UTF-8, UTF-16 etc. Strictly speaking, it's only author names (and sometimes filenames) that can be an issue. Again, skipping data would mean you get an invalid output, which is not an option either.

@CFoltin You can only catch it once you encounter it, so even if you catch it, it would still take time before you know about it. Also, depending on what is wrong with the environment, there are a number of fixes that may or may not work.

In the end, it comes down to the fact that you can't know for sure what character set you may encounter in the repository. It can even be several ones.

One option I can see that I could live with is to catch it and print it out with replaced characters... We could then add a disclaimer at the end of the output stating that the output is not 100% correct and that it had to be modified in order to accommodate the terminal charset. However, I'm afraid it would raise even more questions though, as you know longer have the python exception to search on. Alternatively, the first/last exception encountered could also be included in the disclaimer.

adam-waldenberg commented 3 years ago

I have decided to catch this exception and let the error message point to some of the issues here on the project page. This should let people that run into this problem to more effectively understand it and remedy it.

banbar commented 3 years ago

I have received a similar error: UnicodeEncodeError: 'charmap' codec can't encode character '\u011f' in position 380: character maps to Python version: 3.7

Found the solution: https://stackoverflow.com/a/57134096/1959766

adam-waldenberg commented 3 years ago

@banbar Thank you. I don't think that Windows-specific solution has been covered anywhere on the issue tracker so far. I know it's more a Python and terminal thing than it is a gitinspector thing, but it I'm considering doing a F.A.Q/Wiki with common environment related issues that can be encountered. Maybe link into the issue tracker etc.

Now and again this (or related issues) keep coming up.