Require DFXML be encoded as UTF-8

ajnelson-nist commented 6 years ago

DFXML can be generated for file systems that do not use UTF-8 encoding, or even that use arbitrary bytes. (For instance, HFS (not HFS+), allows any byte in a file name except the ASCII colon character.) DFXML has three objectives in case these bytes are encountered:

The original bytes should be preserved.
The original bytes should be decoded into human-readable strings.
The decoding process should present UTF-8 character strings (not byte strings) to a DFXML consumer without requiring additional scripting work to recognize transcodings (e.g. no mode should be required when opening a file that presents accented Latin characters originally encoded in macos-roman). When transcodings are done, though, they should be encoded in the DFXML and thus accessible by the DFXML API being used.

Within the DFXML schema, these constructs will be added to any encodable/transcodable string:

The string element will have an optional attribute "original_encoding" to indicate a transcoding occurred.
The original bytes encountered in the parse will be recorded in the optional attribute "original_bytes_base64".

Some new semantics will result from this, because there are now 8 states (represented as sets of 3 conditions) we can have for any of these transcoding states, based on presence or absence of (A) the original bytes in base64, (B) the transcoding, and (C) the element's child text. Let absence of one of these conditions be represented as an underscore below as we walk through the state space:

[___] It could be that a fileobject was not meant to be named, such as with DFXML being used to distribute file hashes (e.g. with some modes of hashdeep).
[__C] Absence of the original bytes and original encoding attributes implies the string was recorded in DFXML exactly as it was encountered.
[_B_] This state is a bit uninformative, and should be avoided.
[_BC] This implies original bytes were not recorded. This state should be avoided in case of an error with the script that performed the transcoding. If such an error occurred, the original bytes may not be derivable from the DFXML record.
[A__] The original bytes could not be rendered as a UTF-8 string. This can occur in an example where control characters are embedded as a file name. (H/t to @dd388 for finding a case where an HFS file system recorded "^C^B^AMove&Rename", special control characters for a Mac OS somewhere around version 7.)
[A_C] If original bytes are present and UTF-8 text is recorded, this shall imply the original encoding was UTF-8. This may be desired in cases where a unicode character could be encoded in multiple ways, such as with unicode combining characters.
[AB_] This state implies the original bytes are decodable, but do not have a corresponding point in unicode space.
[ABC] All t's crossed, all i's dotted.

In short, the preferred states are to include original bytes (conditions A**), and include UTF-8 encodings when they are reachable (conditions **C). If there is nothing more complex than ASCII, __C (omitting original bytes and original encoding) would be fine. If the character data are more complex than ASCII, and there is no chance of ambiguity, the original encoding can be omitted (condition A_C). If the file names are all ASCII, or unicode where all characters only have one representation, condition A_C would suffice; however, this may be unnecessarily difficult to determine on the fly, so unicode filenames may be best represented verbosely (condition ABC).

Thanks to @dd388 for assistance drafting this description, to @tw4l for raising the matter, and @simsong for discussion and an article on Programming in Unicode. The original structure proposed in this Issue is close to what came from discussion in the DFXML library Issue.

simsong commented 6 years ago

Good work. Thanks for the kind mention of my Usenix Unicode article; it’s one of my favorites.

On Jul 20, 2018, at 4:11 PM, Alex Nelson notifications@github.com wrote:

DFXML can be generated for file systems that do not use UTF-8 encoding, or even that use arbitrary bytes. (For instance, HFS (not HFS+), allows any byte in a file name except the ASCII colon character.) DFXML has three objectives in case these bytes are encountered:

The original bytes should be preserved. The original bytes should be decoded into human-readable strings. The decoding process should present UTF-8 character strings (not byte strings) to a DFXML consumer without requiring additional scripting work to recognize transcodings (e.g. no mode should be required when opening a file that presents accented Latin characters originally encoded in macos-roman). When transcodings are done, though, they should be encoded in the DFXML and thus accessible by the DFXML API being used. Within the DFXML schema, these constructs will be added to any encodable/transcodable string:

The string element will have an optional attribute "original_encoding" to indicate a transcoding occurred. The original bytes encountered in the parse will be recorded in the optional attribute "original_bytes_base64". Some new semantics will result from this, because there are now 8 states (represented as sets of 3 conditions) we can have for any of these transcoding states, based on presence or absence of (A) the original bytes in base64, (B) the transcoding, and (C) the element's child text. Let absence of one of these conditions be represented as an underscore below as we walk through the state space:

[_] It could be that a fileobject was not meant to be named, such as with DFXML being used to distribute file hashes (e.g. with some modes of hashdeep). [__C] Absence of the original bytes and original encoding attributes implies the string was recorded in DFXML exactly as it was encountered. [B] This state is a bit uninformative, and should be avoided. [_BC] This implies original bytes were not recorded. This state should be avoided in case of an error with the script that performed the transcoding. If such an error occurred, the original bytes may not be derivable from the DFXML record. [A] The original bytes could not be rendered as a UTF-8 string. This can occur in an example where control characters are embedded as a file name. (H/t to @dd388 for finding a case where an HFS file system recorded "^C^B^AMove&Rename", special control characters for a Mac OS somewhere around version 7.) [AC] If original bytes are present and UTF-8 text is recorded, this shall imply the original encoding was UTF-8. This may be desired in cases where a unicode character could be encoded in multiple ways, such as with unicode combining characters. [AB] This state implies the original bytes are decodable, but do not have a corresponding point in unicode space. [ABC] All t's crossed, all i's dotted. In short, the preferred states are to include original bytes (conditions A), and include UTF-8 encodings when they are reachable (conditions C). If there is nothing more complex than ASCII, __C (omitting original bytes and original encoding) would be fine. If the character data are more complex than ASCII, and there is no chance of ambiguity, the original encoding can be omitted (condition A_C). If the file names are all ASCII, or unicode where all characters only have one representation, condition A_C would suffice; however, this may be unnecessarily difficult to determine on the fly, so unicode filenames may be best represented verbosely (condition ABC).

Thanks to @dd388 for assistance drafting this description, to @timothyryanwalsh for raising the matter, and @simsong for discussion and an article on Programming in Unicode. The original structure proposed in this Issue is close to what came from discussion in the DFXML library Issue.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

ajnelson-nist commented 3 years ago

I've pushed this back a release to 1.4.0, because this needs a prototype code implementation, and a graceful-feeling solution has not yet come to mind.

ajnelson-nist commented 1 year ago

To give an illustrative example of this issue: One project I've encountered processed software files that included the "Registered" symbol in its file names. However, the method of producing those files ended up encoding that symbol as the single byte value 174 (b"\xae"), which is not a valid unicode code point. This Python session transcript shows how that data should transcode to UTF-8, along with demonstrations of decoding stumbles:

>>> import base64
>>> x = b"Fancy Product \xae.exe"
>>> x
b'Fancy Product \xae.exe'
>>> base64.standard_b64encode(x)
b'RmFuY3kgUHJvZHVjdCCuLmV4ZQ=='
>>> x.decode("ascii")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xae in position 14: ordinal not in range(128)
>>> x.decode("utf-8")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xae in position 14: invalid start byte
>>> x.decode("iso-8859-1")
'Fancy Product ®'
>>> y = x.decode("iso-8859-1").encode("utf-8")
>>> y
b'Fancy Product \xc2\xae.exe'
>>> base64.standard_b64encode(y)
b'RmFuY3kgUHJvZHVjdCDCri5leGU='

As DFXML, this fileobject should present like so after implementation of this Issue in the schema:

<fileobject>
  <filename
    original_bytes="RmFuY3kgUHJvZHVjdCCu"
    original_encoding="iso-8859-1">Fancy Product ®.exe</filename>
</fileobject>

joachimmetz commented 1 year ago

@ajnelson-nist one clarifying question with "UTF-8 character strings" do you mean the 4-byte variant of RFC 3629 or the 6-byte variant of RFC 2279? (I assume the former but prefer to be specific about it in this context)

ajnelson-nist commented 1 year ago

@joachimmetz : I had intended, without digging into citations, to use the 4-byte variant of RFC 3629. But you raise a fair question.

The main influences in my understanding are XML and Python, which are directly in the dependencies of most DFXML applications I'm aware of; and RDF, which is not necessarily pertinent to DFXML but I am aware does have a definition somewhere in its standards stack that its strings are UTF-8. I'm unaware of whether C++ has any inherent dependencies on Unicode; my understanding is there is no such dependency due to C++ predating unicode and just generally operating on more elementary data types, but @simsong could probably say better if we need something better said. I currently suspect we won't need better said.

I believe RFC 2279^rfc2279 is moot for consideration, because RFC 3629^rfc3629 obseletes 2279; RFC 3629 "implements" (loose terminology) ISO 10646^iso10646; ISO 10646's Annex D (albeit the 2003 version) provides a "technically equivalent" definition in the Unicode Standard per unicode.org's glossary definition of UTF-8^unicodeorgutf8; and unicode.org is cited (by bibliography entry, not URL) as a normative reference of XML 1.0 Fifth Edition^unicodeorgutf8.

Python 3's documentation cites unicode.org in this highlighted section of the documentation page "Unicode HOWTO". So I'd follow the same reference chain, ending at RFC 3629, answering your question again with "I meant 4 bytes."

I haven't done the same dive recently through RDF, but my recollection is the citation chain goes through RDF Schema following XML Schema Datatypes.

If you're aware of an application that should make DFXML consider 6-byte UTF-8, I'd be curious to hear about it, but it would be a pretty significant conflict with DFXML's foundation on XML to try to support 6-byte UTF-8.

ajnelson-nist commented 1 year ago

I should note: It appears DFXML has always invisibly required its string-y content be UTF-8 on accident because of some technological dependencies, especially between XML and Python. This Issue was filed possibly without realizing that, but there is still a real challenge being addressed in this Issue, on how to represent transcoding of non-UTF-8 source data.

joachimmetz commented 1 year ago

Sticking with the 4-byte variant makes sense, it is the current version of UTF-8 and compatible with UTF-16 and the one supported by a current Python 3 implementation (e.g. surrogate pair restriction). I'm not sure about XML, I would assume it supports the 4-byte one.

However the 6-byte variant is still used by certain formats and doesn't have the surrogate pair restriction as far as I can tell. So might be good to account for it for compatibility reasons at minimum.

ajnelson-nist commented 1 year ago

Are you able to link to those certain formats are that are using the 6-byte variant?

Do you know how "UTF-8 as defined by RFC 2279" should be spelled as as an encoding string, e.g. like the strings in the "Standard Encodings" table in Python's codecs module^pythoncodecsstandardencodings?

joachimmetz commented 1 year ago

Do you know how "UTF-8 as defined by RFC 2279" should be spelled as as an encoding string, e.g. like the strings in the "Standard Encodings" table in Python's codecs module1?

don't think it supports it, but have not looked

Are you able to link to those certain formats are that are using the 6-byte variant?

not from the top of my mind, but I assume anything claiming to be utf-8 before RFC 3629 became a thing

Looks like some Microsoft formats might be using it https://learn.microsoft.com/en-us/search/?scope=OpenSpecs&terms=RFC2279

joachimmetz commented 1 year ago

The original bytes could not be rendered as a UTF-8 string. This can occur in an example where control characters are embedded as a file name

Any example of these? AFAIK C0 and C1 control character can be represented in 4-byte UTF-8 (RFC 3629). Also see: https://en.wikipedia.org/wiki/C0_and_C1_control_codes

AFAIK (1) Surrogates such as U+d800, (2) values (currently) not mapped to characters and (3) values beyond U+10FFFF are going to be the ones that need special treatment

For (2) 4-byte UTF-8 should be able to encode these, but might not meet the "human-readable strings" criteria mentioned above

One option could be to use "\U########" and "\u####" string notation for such characters.

Based on https://www.w3.org/TR/xml/#dt-charref and https://www.w3.org/TR/xml/#wf-Legalchar I'm not 100% sure if XML character escape allows "&#d800"

If the character reference begins with " &#x ", the digits and letters up to the terminating ; provide a hexadecimal representation of the character's code point in ISO/IEC 10646. If it begins just with " &# ", the digits up to the terminating ; provide a decimal representation of the character's code point.

Unfortunately ISO/IEC 10646 has evolved/changed over the years [1].

This write up provides some historical context https://www.cl.cam.ac.uk/~mgk25/unicode.html

joachimmetz commented 1 year ago

What I could find is that both XML 1.0 and 1.1 are strict about not allowing such characters https://www.w3.org/TR/2006/REC-xml-20060816/Overview.html#charsets and https://www.w3.org/TR/xml11/#charsets

And if I read the following [1] correctly:

Well-formedness constraint: Legal Character

Characters referred to using character references must match the production for [Char](https://www.w3.org/TR/2006/REC-xml-20060816/Overview.html#NT-Char).

It could be that &#d800 is not allowed per standard

joachimmetz commented 1 year ago

Some Python relates references:

joachimmetz commented 1 year ago

@ajnelson-nist a couple of more scenarios to consider

original path uses a specific codepage (encoding), which is converted to Unicode, however that can be encoded into multiple variations of the original encoding e.g. encoding U+2252 to cp932 also see https://metacpan.org/dist/ShiftJIS-CP932-MapUTF/view/MapUTF.pod#Transcoding-from-Unicode-to-CP-932. What if there are 2 (or more) paths that decode to the same string? How should the original path be best preserved?
filename contains a path segment separator (e.g. \ or /), if not escaped this leads to ambiguity e.g. if / is a path segment separator is 'test/1234' a single file name or a path ?

joachimmetz commented 1 year ago

Looks like there is WTF-8 https://en.wikipedia.org/wiki/UTF-8#WTF-8

simsong commented 1 year ago

Nice.

— Reply to this email directly, view it on GitHubhttps://github.com/dfxml-working-group/dfxml_schema/issues/34#issuecomment-1628061725, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAMFHLDWQNT2INJ6B72FZTLXPN2TLANCNFSM4FLEYBNA. You are receiving this because you were mentioned.Message ID: @.***>

joachimmetz commented 1 year ago

One of the software engineers raised a good point Python has pathlib for this as well which might also help cover the cp932 edge cases I mentioned

dfxml-working-group / dfxml_schema

Require DFXML be encoded as UTF-8 #34