Support for Unicode UTF-8 Identifiers

erslavin commented 1 year ago

Description of Change(s)

Initial draft for review session with Pixar based on submitted white paper with additional changes to support separating out original ASCII parser and implementation of new Unicode UTF-8 parser to fully retain original logic.

Added additional lexer / parser for UTF-8 Identifiers
Added TF setting to scope changes behind TF_UTF8_IDENTIFIERS
Made first pass at Tf string methods to switch on setting and use ASCII vs. Unicode validation rules
Enabled different rules for identifiers vs. prim names (validation for prim names falls back to identifiers when under ASCII rules)
Added UCA sorting for UTF-8 strings and tests against Unicode conformance tests
Pulled common parse layer methods out of textFileFormat parser to enable use by both ASCII and Unicode flavors
Added support for Unicode UTF-8 identifiers / prim names in usdGenSchema

Caveats:

There are 4 tests that do not pass when the new TF_UTF8_IDENTIFIERS setting is on; this is due to the TF environment banner being written out, which makes the output differ from the files checked in to compare against
Original logic should be retained in all cases. There will be a small performance hit to check against the runtime switch TF_UTF8_IDENTIFIERS to determine which code paths to take (ASCII vs. Unicode). This is something that needs further investigation.
Needs testing against very large stages to understand impact of both performance hit to original logic (because of the runtime switch) as well as impact of UTF-8 character processing. In particular, the UCA (Unicode Collation Algorithm) for sorting is likely much slower than the dictionary ordering algorithm currently in place. Since this affects ordering of child prims and property names, it's something that needs to be measured on large stages.
There are a number of files that are added for unicode character class caching (to try and speed up some of the character class checks) that are generated by scripts in the pxr/base/tf/unicode directory. There is currently no support in the build process to build and run these scripts to generate these files. This should be added for completeness.
The new lexer / parser generated for Unicode path and textFileFormat processing need to be added to the portion of the build script that allows users to generate the lexers / parsers and verified that the complete generate --> build --> run sequence works for all of the new changes.
Flex 2.6.4 and bison 2.7.12 were used to generate the new lexer / parser and the existing textFileFormat ASCII parser was hand modified. Need to generate these files completely using these known versions. These versions were chosen to also support win_flex and win_bison in the future for those generating on windows outside of WSL.
The tests now have to be run with TF_UTF8_IDENTIFIERS off (the default case) and on (the new case) for complete verification of additional changes.

Fixes Issue(s)

-

[x] I have verified that all unit tests pass with the proposed changes
[X] I have submitted a signed Contributor License Agreement

spitzak commented 1 year ago

I would stay away from using any usd or locale specific collating. You do not want the sort to change depending on the locale. Also (at least on Linux) the libraries have serious errors or bugs, such as ignoring punctuation, so that this is considered the correct sorted order:

    fooa.jpg
    foo.jpg
    fook.jpg

(in case you can't figure out what happened, 'a'<'j'<'k')

It's ok if you use information about USD code points to choose collating order, but it has to be exactly the same order for all users, and with no rude surprises like the above. Also it is far more important to fix numerical ordering (so foo9 < foo10) and that usually means you can't use any stock collator other than gnu versionsort.

erslavin commented 1 year ago

@spitzak The implementation here is a (unoptimized) hand-rolled implementation of the Unicode Collation Algorithm, which you can find here: https://unicode.org/reports/tr10/. This does not depend on any locale specific functions (and in fact, ignores the very few use cases where locale context is required to ascertain correct ordering), nor does it depend on additional libraries to not introduce additional dependencies into USD.

The resulting ordering will be exactly the same for all users - this is the beauty of UCA. However, and specifically with regard to numerical sorting as you might be used to in ASCII dictionary orderings, the default collation tables (which are quite extensive) do not order latin numerics the way you might think (foo1, foo2, and foo17 should sort as foo1, foo17, and foo2 according to UCA). It is possible (and this is one case where it is explicitly stated in TR10 that you might want to have) to hand tweak the provided collation tables to get the specialized behavior you want. So while it is possible to implement a UCA variant that provides the exact same sorting order as USD does today, it is not trivial, the collation tables for this are not provided by the Unicode standard, and it would take extra effort to do so (in addition with coming up with a new set of conformance tests, which are quite extensive and provided as part of this PR in TR10).

I expect there to be some hesitance in regard to the different sorting, and that's ok (however do note that the sort order is still exactly the same for every user while still taking into account the complexities of characters outside the non-extended Latin range). The tradeoffs to having support for Unicode vs. not until there is a complete agreement in sort order to me weigh in favor of having, especially since support is behind a run-time switch that defaults to existing behavior. Where the sort order is different in the existing tests should be clear from the changes in the PR. Furthermore, USD provides functionality to customize property ordering, so there is a mitigation that can be used to preserve today's functionality, if that's specifically what is needed in your workflows.

spitzak commented 1 year ago

That is good if the sort order is the same for everybody in all cases. This includes when LC_COLLATE is set to "" or "C"?

The others are just annoyances. PLEASE fix the ASCII punctuation so it always sorts before any letters. I have to run with LC_COLLATE set to C to fix this as it makes the names average users write on files not sort in any logical way. Make absolutely sure that "foo.jpg" sorts less than "fooa.jpg". This breaks the sorting of variable-sized numbers far more than the need for strverscmp.

As for the numbers, well failure to support that is really really common. But it is hugely annoying to users who do not understand why they have to zero-pad all their numbers.

My recommendation is that the sort algorithm purposely split the strings at ASCII punctuation and numbers and sort only the character substrings with the unicode collating algorithm. The other substrings should be considered less than the characters and compared with each other using strverscmp.

spitzak commented 1 year ago

Note that LC_COLLATE="C" is not a good solution, it does not sort characters in the way users expect. But the problem with ASCII puncutation (especially '.' and '-' and '/') is so overwhelmingly bad that I have to run it this way. A hybrid solution that does this correctly but also uses Unicode for the hard part of sorting letters would be ideal.

sunyab commented 1 year ago

Filed as internal issue #USD-7818

nvmkuruc commented 1 month ago

@erslavin I think we can close this PR now that it's been broken up and integrated in 24.03.

jesschimein commented 1 month ago

/AzurePipelines run

azure-pipelines[bot] commented 1 month ago

Pull request contains merge conflicts.

erslavin commented 1 month ago

Old PR

PixarAnimationStudios / OpenUSD

Support for Unicode UTF-8 Identifiers #2120

Description of Change(s)

Fixes Issue(s)