bcpierce00 / unison

Unison file synchronizer
GNU General Public License v3.0
3.86k stars 224 forks source link

Use UTF-8 encoding-safe string truncation in UI #962

Closed tleedjarv closed 9 months ago

tleedjarv commented 10 months ago

The names of roots can be very long and are truncated in UI in the list of updates. Truncating by number of bytes does not play well with UTF-8 input. This patch makes the truncation safe on UTF-8 encoded strings while keeping the length of truncated strings roughly as before (and the result should also look correct for Western languages, maybe also for other scripts).

There is one tiny caveat. Since this patch applies to both TUI and GUI (also macOS native GUI), this may have a side-effect of scrambling any non-ISO-8859-1 (or Windows-1252 in GUI) and non-UTF-8 encodings, if such would ever be used. There is no change for the GTK GUI, as it has always expected either ISO-8859-1/Windows-1252 or UTF-8. I don't know if this will actually be a problem because I don't know if Unison has ever worked properly with these encodings. This caveat only applies to the root names, not to file/dir names. This is a display-only issue, it will not break any syncs!

This is a partial fix for #959.

gdt commented 9 months ago

I have called for testing on unison-users@ and will merge in 7 days if no adverse comments.

igrep commented 9 months ago

I've tested unison-7f147080.ocaml-4.14.0+mingw64c.windows-2022\unison.exe on my Windows 11 22H2 (both when chcp 932 and when chcp 65001). It looks like working without any problem after creating/deleting a あああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああ.txt.

tleedjarv commented 9 months ago

Thank you for testing. Syncing functionality is not affected by this patch. Only the truncated display of root names as "left" and "right" headings is affected. Realistically, the potential encoding issue could only manifest in the text UI because the same UTF-8 conversions have always been done in the GUIs; this patch shouldn't change anything there.

igrep commented 9 months ago

Ah, sorry. I tried again, then found it looked good:

> .\unison C:\t\ああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああ C:\t\いいいいいいいいいいいいいいいいいいいいいいいいいいいいいいいいいいいいいいいいいいいいいいいいいいいいいいいいいいいいいいいいいいいいいいいいいいいいいいいいいいいいいいいいいいいいいいいいいいいい
Unison 2.53.3 (ocaml 4.14.0): Contacting server...
Looking for changes
Reconciling changes
The root of one of the replicas has been completely emptied.
Unison may delete everything in the other replica.  (Set the
'confirmbigdel' preference to false to disable this check.)

Do you really want to proceed? []
Unrecognized command ' ': try again  [type '?' or F1 for help]
Do you really want to proceed? [] y
あああああああああ...   いいいいいいいいい...
         <---- deleted    新規 テキスト ドキュメント.txt  [f] ,
tleedjarv commented 9 months ago

Thank you for testing! Is this with UTF-8 or some other encoding?

igrep commented 9 months ago

Sorry again! I tried when chcp 932, and confirmed the result is same with my last comment (UTF-8: chcp 65001).

tleedjarv commented 9 months ago

Sorry again! I tried when chcp 932, and confirmed the result is same with my last comment (UTF-8: chcp 65001).

I suspect that this will always work properly in Windows, with NTFS at least. It may break (meaning display will be garbled) with FAT when the dirnames are knowingly created in codepage other than UTF-8. But I don't know how that works in Windows.

I think the only real risk case is some Unix/Linux when dirnames on disk are encoding-less blobs (as seen by the OS and the filesystem) and those blobs are neither valid UTF-8 nor Latin 1 (ASCII is of course a subset of both). (As a side note: When it comes to multi-byte encodings, the display would have always been garbled, no regression there.)

Thank you for testing! At least we know nothing's immediately broken, and even then, any eventual breakage is going to be limited and cosmetic only.