Tarsnap / tarsnap

Command-line client code for Tarsnap.
https://tarsnap.com
Other
864 stars 60 forks source link

Fix "leading surrogate" in UTF-8 (actually CESU-8) #607

Closed gperciva closed 7 months ago

gperciva commented 8 months ago

Amusingly, the code which is described as:

This is a leading surrogate; some idiot has...

has a typo: 0xDC00 should be 0xD800.

The comment mentions a "leading surrogate", which is a synonym for a high-surrogate code unit:

A 16-bit code unit in the range D800_16 to DBFF_16, used in UTF-16 as the leading code unit of a surrogate pair. Also known as a leading surrogate. https://unicode.org/glossary/#high_surrogate_code_unit

What libarchive is doing here is adjusting for an invalid conversion of UTF16 to UTF8; this adjustment is now known as the Compatibility Encoding Scheme for UTF-16: 8-Bit (CESU-8), published as the Unicode Technical Report #26 [1].

[1] https://www.unicode.org/reports/tr26/

Essentially, if libarchive detects a surrogate pair (not allowed in UTF-8 [2]), it tries to construct the desired unicode value (as per CESU-8).

[2] That unicode value should be encoded with 4 octets, whereas the surrogate pair requires 6 octets.

Caveat:

gperciva commented 8 months ago

tl;dr

There's a typo in libarchive-2.7 which stopped the code from doing what it was supposed to.

Modern libarchive uses a different method of doing the same thing (so the typo wasn't explicitly "fixed", but it's not broken any more).

Experiments

To investigate this, I created utf8-good.tar and utf8-bad.tar (in the attached zip file, because github doesn't allow us to upload tar files).

tar-utf8-experiments.zip

Both tar files contain a single 0-byte file called filename-😀. (that's U+1F600, a smile emoji)

In the "good" version, the tar file begins:

00000000: 5061 7848 6561 6465 722f 6669 6c65 6e61  PaxHeader/filena
00000010: 6d65 2df0 9f98 8000 0000 0000 0000 0000  me-.............

whereas the bad one encodes the smile emoji as a surrogate pair, and begins:

00000000: 5061 7848 6561 6465 722f 6669 6c65 6e61  PaxHeader/filena
00000010: 6d65 2ded a0bd edb8 8000 0000 0000 0000  me-.............

The diff on those two lines is:

-00000010: 6d65 2df0 9f98 8000 0000 0000 0000 0000  me-.............
+00000010: 6d65 2ded a0bd edb8 8000 0000 0000 0000  me-.............

(The full diff has more: the filename is repeated 3 times, and the checksum and path length changes. But those aren't important.)

tar programs

I tested libarchive-2.7, libarchvie 3.6.0 (the default in freebsd 12.4), and gnu tar 1.35, with tar -tf utf8-good.tar and tar -tf utf8-bad.tar.

$ ~/src/libarchive-2.7/b/bsdtar -tf utf8-bad.tar 
bsdtar: Pathname in pax header can't be converted to current locale.
filename-\355\240\275\355\270\200
bsdtar: Error exit delayed from previous errors.

When I tried applying this fix to libarchive-2.7, it worked fine:

$ ~/src/libarchive-2.7/b/bsdtar -tf utf8-bad.tar 
filename-😀

(that's the modified libarchive-2.7)