MestreLion / git-tools

Assorted git tools, including git-restore-mtime
GNU General Public License v3.0
295 stars 74 forks source link

UTF-8 Error #64

Closed cocox closed 1 year ago

cocox commented 1 year ago

If i try to execute the command 'git restore-mtime --test' i get this error:

Traceback (most recent call last): File "/usr/lib/git-core/git-restore-mtime", line 594, in sys.exit(main()) ^^^^^^ File "/usr/lib/git-core/git-restore-mtime", line 530, in main parse_log(filelist, dirlist, stats, git, args.merge, args.pathspec) File "/usr/lib/git-core/git-restore-mtime", line 410, in parse_log file = normalize(file) ^^^^^^^^^^^^^^^ File "/usr/lib/git-core/git-restore-mtime", line 254, in normalize .decode('utf8')) # Decode from UTF-8 ^^^^^^^^^^^^^^ UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 38: invalid continuation byte

cocox commented 1 year ago

The failing files are: "lp/2017/lp-marca/latam/brand/img/Sin-t\303\255tulo-2.jpg" "lp/2017/lp-marca/latam/brand/img/Sin-t\355tulo-2.jpg"

MestreLion commented 1 year ago

Just so I can reproduce your environment, please tell me:

MestreLion commented 1 year ago

Just created a brand new repo with just these 2 files:

"lp/2017/lp-marca/latam/brand/img/Sin-t\303\255tulo-2.jpg"

This one seems to work just fine

"lp/2017/lp-marca/latam/brand/img/Sin-t\355tulo-2.jpg"

This does not look like a valid a valid UTF-8 filename... and it triggers the exact error you posted.

Not sure how I could handle such "invalid" filenames, or even if I should handle them...

MestreLion commented 1 year ago
rodrigo@desktop ~/teste $ git init
Reinitialized existing Git repository in /home/rodrigo/teste/.git/
rodrigo@desktop ~/teste $ git config hooks.allownonascii true
rodrigo@desktop ~/teste $ touch "$(printf "Sin-t\303\255tulo-2.jpg")"
rodrigo@desktop ~/teste $ ls -l
total 0
-rw-rw-r--  1 rodrigo rodrigo      0 Jul 13 04:55 Sin-título-2.jpg
rodrigo@desktop ~/teste $ git add .
rodrigo@desktop ~/teste $ git commit -m 'initial'
[main (root-commit) 6f203a3] initial
 1 file changed, 0 insertions(+), 0 deletions(-)
 create mode 100644 "Sin-t\303\255tulo-2.jpg"
rodrigo@desktop ~/teste $ git-restore-mtime --verbose
1 files to be processed in work dir
Line #  Log #   F.Left  Modification Time   File Name
3   1   0   2023-07-13 04:56:14 Sin-título-2.jpg
3   1   -   2023-07-13 04:56:14 ./
Statistics:
         0.01 seconds
            3 log lines processed
            1 commits evaluated
            1 directories updated
            1 files updated
rodrigo@desktop ~/teste $ ls -l
total 0
-rw-rw-r-- 1 rodrigo rodrigo 0 Jul 13 04:56 Sin-título-2.jpg
rodrigo@desktop ~/teste $ touch "$(printf "Sin-t\355tulo-2.jpg")"
rodrigo@desktop ~/teste $ ls -l
total 0
-rw-rw-r-- 1 rodrigo rodrigo 0 Jul 13 04:56  Sin-título-2.jpg
-rw-rw-r-- 1 rodrigo rodrigo 0 Jul 13 04:57 'Sin-t'$'\355''tulo-2.jpg'
rodrigo@desktop ~/teste $ git add .
rodrigo@desktop ~/teste $ git commit -m 'bad filename'
[main 8d716db] bad filename
 1 file changed, 0 insertions(+), 0 deletions(-)
 create mode 100644 "Sin-t\355tulo-2.jpg"
rodrigo@desktop ~/teste $ git-restore-mtime --verbose
Traceback (most recent call last):
  File "/home/rodrigo/.local/bin/git-restore-mtime", line 594, in <module>
    sys.exit(main())
  File "/home/rodrigo/.local/bin/git-restore-mtime", line 486, in main
    filelist = set(git.ls_files(args.pathspec))
  File "/home/rodrigo/.local/bin/git-restore-mtime", line 311, in <genexpr>
    return (normalize(_) for _ in self._run('ls-files --full-name', paths))
  File "/home/rodrigo/.local/bin/git-restore-mtime", line 254, in normalize
    .decode('utf8'))           # Decode from UTF-8
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 5: invalid continuation byte
cocox commented 1 year ago

Just so I can reproduce your environment, please tell me:

  • What version are you using, and where did you get it from? (is git restore-mtime --version available?) git-restore-mtime version 2022.12 throught APT
  • What is the underlying filesystem? NTFS, EXT4, something else? ext4
  • What platform/OS and version? Debian 12
  • Can you paste the non-escaped, UTF-8 filenames here? I'm having some trouble re-creating them I thnk you could reproduce, correct?
MestreLion commented 1 year ago
  • Can you paste the non-escaped, UTF-8 filenames here? I'm having some trouble re-creating them I thnk you could reproduce, correct?

I guess I did, just not sure if the filenames I created are exactly the same as yours.

The first one, with proper UTF-8, git restore-mtime seems to handle just fine. Can you confirm that by creating a brand new repository with just that file?

The second one, Sin-t\355tulo-2.jpg, is the problematic one. But I'm not sure if I'm re-creating it accurately. Is it really a filename with invalid UTF-8 encoding? This looks like the old Windows-1252 encoding (\355 is 0xED, which is í in that encoding, the same as \303\255 in UTF8).

Mixing different encodings in the same filesystem is problematic enough, let alone committing such files to a git repository. I might be able to handle such cases, just not sure if git restore-mtime should deal with invalid (or mixed) encodings

MestreLion commented 1 year ago

@cocox : another test, please post the result of: python3 -c 'import os; d = os.listdir(); print(d); [print(_) for _ in d]' in a directory containing just those 2 files?