PixarAnimationStudios / OpenUSD

Universal Scene Description
http://www.openusd.org
Other
5.5k stars 1.14k forks source link

Support for Unicode UTF-8 Identifiers #2120

Closed erslavin closed 1 month ago

erslavin commented 1 year ago

Description of Change(s)

Initial draft for review session with Pixar based on submitted white paper with additional changes to support separating out original ASCII parser and implementation of new Unicode UTF-8 parser to fully retain original logic.

Caveats:

Fixes Issue(s)

-

spitzak commented 1 year ago

I would stay away from using any usd or locale specific collating. You do not want the sort to change depending on the locale. Also (at least on Linux) the libraries have serious errors or bugs, such as ignoring punctuation, so that this is considered the correct sorted order:

    fooa.jpg
    foo.jpg
    fook.jpg

(in case you can't figure out what happened, 'a'<'j'<'k')

It's ok if you use information about USD code points to choose collating order, but it has to be exactly the same order for all users, and with no rude surprises like the above. Also it is far more important to fix numerical ordering (so foo9 < foo10) and that usually means you can't use any stock collator other than gnu versionsort.

erslavin commented 1 year ago

@spitzak The implementation here is a (unoptimized) hand-rolled implementation of the Unicode Collation Algorithm, which you can find here: https://unicode.org/reports/tr10/. This does not depend on any locale specific functions (and in fact, ignores the very few use cases where locale context is required to ascertain correct ordering), nor does it depend on additional libraries to not introduce additional dependencies into USD.

The resulting ordering will be exactly the same for all users - this is the beauty of UCA. However, and specifically with regard to numerical sorting as you might be used to in ASCII dictionary orderings, the default collation tables (which are quite extensive) do not order latin numerics the way you might think (foo1, foo2, and foo17 should sort as foo1, foo17, and foo2 according to UCA). It is possible (and this is one case where it is explicitly stated in TR10 that you might want to have) to hand tweak the provided collation tables to get the specialized behavior you want. So while it is possible to implement a UCA variant that provides the exact same sorting order as USD does today, it is not trivial, the collation tables for this are not provided by the Unicode standard, and it would take extra effort to do so (in addition with coming up with a new set of conformance tests, which are quite extensive and provided as part of this PR in TR10).

I expect there to be some hesitance in regard to the different sorting, and that's ok (however do note that the sort order is still exactly the same for every user while still taking into account the complexities of characters outside the non-extended Latin range). The tradeoffs to having support for Unicode vs. not until there is a complete agreement in sort order to me weigh in favor of having, especially since support is behind a run-time switch that defaults to existing behavior. Where the sort order is different in the existing tests should be clear from the changes in the PR. Furthermore, USD provides functionality to customize property ordering, so there is a mitigation that can be used to preserve today's functionality, if that's specifically what is needed in your workflows.

spitzak commented 1 year ago

That is good if the sort order is the same for everybody in all cases. This includes when LC_COLLATE is set to "" or "C"?

The others are just annoyances. PLEASE fix the ASCII punctuation so it always sorts before any letters. I have to run with LC_COLLATE set to C to fix this as it makes the names average users write on files not sort in any logical way. Make absolutely sure that "foo.jpg" sorts less than "fooa.jpg". This breaks the sorting of variable-sized numbers far more than the need for strverscmp.

As for the numbers, well failure to support that is really really common. But it is hugely annoying to users who do not understand why they have to zero-pad all their numbers.

My recommendation is that the sort algorithm purposely split the strings at ASCII punctuation and numbers and sort only the character substrings with the unicode collating algorithm. The other substrings should be considered less than the characters and compared with each other using strverscmp.

spitzak commented 1 year ago

Note that LC_COLLATE="C" is not a good solution, it does not sort characters in the way users expect. But the problem with ASCII puncutation (especially '.' and '-' and '/') is so overwhelmingly bad that I have to run it this way. A hybrid solution that does this correctly but also uses Unicode for the hard part of sorting letters would be ideal.

sunyab commented 1 year ago

Filed as internal issue #USD-7818

nvmkuruc commented 1 month ago

@erslavin I think we can close this PR now that it's been broken up and integrated in 24.03.

jesschimein commented 1 month ago

/AzurePipelines run

azure-pipelines[bot] commented 1 month ago
Pull request contains merge conflicts.
erslavin commented 1 month ago

Old PR