frej / fast-export

A mercurial to git converter using git-fast-import
http://repo.or.cz/w/fast-export.git
808 stars 255 forks source link

Commit messages are decoded as cp1252 with parameter -e utf8 #286

Closed janssjo closed 2 years ago

janssjo commented 2 years ago

Environment: Windows 10 WSL (Ubuntu 20.04).

Commit messages are decoded as cp1252 when calling with the following parameters:

<PATH>/hg-fast-export.sh -r . -m 123 --hg-hash -A authors.map -B branches.map -T tags.map -n -e utf8 -fe cp1252

For most messages, it just silently garbles the commit message. E.g. hg commit ä => git commit ä. But for some characters in the commit message (/) the process crashes as follows:

Traceback (most recent call last):
  File "/<REDACTED>/hg-fast-export.py", line 737, in <module>
  File "/<REDACTED>/hg-fast-export.py", line 583, in hg2git
    plugins)
  File "/<REDACTED>/hg-fast-export.py", line 297, in export_commit
    (revnode,_,user,(time,timezone),files,desc,branch,extra)=get_changeset(ui,repo,revision,authors,encoding)
  File "/<REDACTED>/hg2git.py", line 97, in get_changeset
    desc=desc.decode(encoding).encode('utf8')
  File "/usr/lib/python2.7/encodings/cp1252.py", line 15, in decode
    return codecs.charmap_decode(input,errors,decoding_table)
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 65: character maps to <undefined>

When inspecting the encoding, the following debug message prints Encoding: cp1252 for -e utf8. I expected it to print Encoding: utf8. When called without the -e parameter, it again prints Encoding: cp1252, I expected it to print Encoding: in this case.

diff --git a/hg-fast-export.py b/hg-fast-export.py
index 93f35bf..4324c6c 100755
--- a/hg-fast-export.py
+++ b/hg-fast-export.py
@@ -695,6 +695,7 @@ if __name__=='__main__':
   encoding=''
   if options.encoding!=None:
     encoding=options.encoding
+  stderr_buffer.write(b"Encoding: %s\n" % encoding)

   fn_encoding=encoding
   if options.fn_encoding!=None:
janssjo commented 2 years ago

My mistake. The culprit was the parameter -fe cp1252 (single dash should be a double dash).

It works as expected when using parameter --fe cp1252