P2729 Unicode in the Library, Part 2: Normalization

wg21bot commented 1 year ago

P2729R0 Unicode in the Library, Part 2: Normalization (Zach Laine)

tahonermann commented 1 year ago

This needs SG16 review.

brycelelbach commented 1 year ago

2023-02-07 19:30 to 22:00 Issaquah Library Evolution Meeting

P2728R0: Unicode in the Library, Part 1: UTF Transcoding

P2729R0: Unicode in the Library, Part 2: Normalization

2023-02-07 19:30 to 22:00 UTC-8 Issaquah Library Evolution Minutes

Champion: Zach Laine (IP)

Chair: Bryce Adelstein Lelbach (IP) & Ben Craig (IP)

Minute Taker: Robert Leahy (IP)

Start: 2023-02-07 19:41 UTC-8

Does this paper have:

Examples?
- Yes
Field experience?
- Based on Boost Text. There is no clean room implementation from specification.
Performance considerations?
- Yes.
Discussion of prior art?
- Yes.
Changes Library Evolution previously requested?
- N/A - new paper.
Wording?
- No.
Breaking changes?
- No.
Feature test macro?
- Yes.
Freestanding considered?
- Yes.

Open Questions:

Should text facilities support null-terminated strings as input?
What should happen when ill-formed Unicode is encountered? Return the replacement character, throw an exception, or terminate?

Typo in P2728 section 2: "3 UTF-8 code units in sequence may encode a particular code unit" -> the second "code unit" should be "code point".

Typo in P2729 section 4.2: is_normalized calls in the examples should take the format.

Typo in P2729 section 5.2: Unicode versions should have types.

Why utf_8_to_16_iterator instead of utf8_to_16_iterator? Why not use a template parameter for the sizes?

Should formats be enumerators, or should each be its own trivial type?

Maybe the fast but verbose code example shouldn't be the first one in the paper.

Transcoding iterators should model the iterator category of the underlying iterator.

Unicode version should be queried with runtime functions, not constexpr variables.

Why use template parameters for normalization forms but not UTFs? I'd prefer consistency.

End: 21:56

Summary

We took an early look at P2728 and P2729, which propose Unicode facilities for the C++ Standard Library. The proposal includes both low level facilities which should have speed of light performance, and higher level facilities that are composable and easy to use (such as views and ranges).

Next Steps

Proceed with review and incubation in the Text and Unicode study group.

brycelelbach commented 1 year ago

@tahonermann please send this to Library Evolution when it's ready.

tahonermann commented 1 year ago

SG16 review of this paper remains pending while SG16 iterates on P2728 (Unicode in the Library, Part 1: UTF Transcoding).

cplusplus / papers