PixarAnimationStudios / OpenUSD

Universal Scene Description
http://www.openusd.org
Other
5.48k stars 1.14k forks source link

Bi-Directional Transcoding of Invalid Identifiers #3086

Open miguelh-nvidia opened 2 weeks ago

miguelh-nvidia commented 2 weeks ago

Description of Change(s)

This is a direction implementation of the proposal for Transcoding Invalid Identifiers.

Fixes Issue(s)

jesschimein commented 2 weeks ago

Filed as internal issue #USD-9674

jesschimein commented 2 weeks ago

/AzurePipelines run

azure-pipelines[bot] commented 2 weeks ago
Azure Pipelines successfully started running 1 pipeline(s).
jesschimein commented 2 weeks ago

/AzurePipelines run

azure-pipelines[bot] commented 2 weeks ago
Azure Pipelines successfully started running 1 pipeline(s).
spitzak commented 2 weeks ago

Seems excessivly complicated. I would prefer something that translates each "disallowed" byte into a small pattern, such as "_XX" where XX is the hex code, and turn '' into '_5F' when it is followed by a digit or A-F. Since almost all non-ASCII unicode characters are allowed in identifiers, the need to translate lots of them, which is why punycode is designed as it is, is not needed.

miguelh-nvidia commented 2 weeks ago

Transcoding was initially thought parallel to UTF-8. In a Japanese scene where every character is represented with 4 bytes, that will mean a significant overhead (for カーテンウォール, 128 ASCII characters will be required for the representation above mentioned, 32 with base64 (for reference) and the implementation here will use uses 18). While UTF-8 identifiers has reduced the set of invalid characters internal to OpenUSD, there are still reasons users would want ASCII only identifier, that it is still the case for legacy pipelines and interaction with other interfaces. Take for instance this https://forum.aousd.org/t/support-unicode-primvar-name-in-usdview/1438, GLSL still supports only ASCII and has a limit size on identifiers of 1024 bytes. This implementation also doesn't preclude other transcoding schemes or non-bidirectional approaches to making identifiers valid. This is just one approach that we think solves the problems above mentioned.

spitzak commented 2 weeks ago

The transcoding should only make changes necessary to make a valid USD identifier. It should not be making unnecessary changes such as changing valid Unicode characters that USD allows.

spitzak commented 2 weeks ago

I also think any transcoding should be based on the UTF-8 bytes, not on decoding the UTF-8 into Unicode code points. This is so it can deal with invalid UTF-8 which has to be supported just as much as any other arrangement of bytes that a program tries to use as an identifier.

nvmkuruc commented 2 weeks ago

The transcoding should only make changes necessary to make a valid USD identifier. It should not be making unnecessary changes such as changing valid Unicode characters that USD allows.

There's a few motivations why the implementation gives users the option to transcode to ASCII. We expect some users may want to generate content that works in older versions of OpenUSD that may not yet support UTF-8 identifiers. There's also the issue that Miguel mentioned with UTF-8 identifiers not being supported in GLSL.

Ultimately, we think choosing to use this encoding is up to the user. For example, we suspect that there's users with identifiers that don't require bidirectionality and are happy to replace symbols and whitespace with _.

I also think any transcoding should be based on the UTF-8 bytes, not on decoding the UTF-8 into Unicode code points. This is so it can deal with invalid UTF-8 which has to be supported just as much as any other arrangement of bytes that a program tries to use as an identifier.

This is interesting. We hadn't considered encoding invalid UTF-8 to be a feature. We'll think a little bit about this.

nvmkuruc commented 2 weeks ago

I also think any transcoding should be based on the UTF-8 bytes, not on decoding the UTF-8 into Unicode code points. This is so it can deal with invalid UTF-8 which has to be supported just as much as any other arrangement of bytes that a program tries to use as an identifier.

This is interesting. We hadn't considered encoding invalid UTF-8 to be a feature. We'll think a little bit about this.

We talked this over and determined that for this particular transcoding scheme, we would like to keep the approach of only allowing valid UTF-8. One consideration is that changing the algorithm to support invalid code units (instead of valid code points) would impact the compactness of the encoding. We also think there's some conceptual simplicity in aligning the inputs and outputs to a particular encoding. OpenUSD already automatically applies a UTF-8 encode/decode when crossing the C++/Python boundary. There would be some complexity in supporting invalid UTF-8 that would eventually fail to decode.

This proposal doesn't preclude the development of other transcoding proposals and schemes, however, if there's enough of a use case for encoding non-UTF-8 strings as identifiers. We encourage further discussion of cases this approach may not cover.

spitzak commented 1 week ago

One possibility is to "decode" invalid bytes in the UTF-8 to U+DC80...U+DCFF, which is what Python is doing. You can then apply Punycode to the result. The UTF-8 encoding of these code points should be considered invalid for this to truly be lossless. I recommend the UTF-8 encoding of other UTF-16 surrogates be considered valid however, to shorten the result, though you could have other ways to define what is valid.

This would allow this to safely encode an arbitrary byte array, rather than the rather undefined set of "valid UTF-8".

miguelh-nvidia commented 5 days ago

I will check those changes and discuss them with Matt.