buganini / bsdconv

A simple but powerful DSL for charset/encoding conversion and transformation, pure C implementation with no extra dependencies
https://bsdconv.io/bsdconv/
BSD 2-Clause "Simplified" License
53 stars 6 forks source link

CESU-8 (UTF-16 surrogates in UTF-8) and strict UTF-8 (no overlong, no surrogates). #15

Closed Artoria2e5 closed 7 years ago

buganini commented 7 years ago

For UTF-8, the "overlong" means unnecessary long (like using \xC1\xA1 as 'a') or code point over U+10FFFF?

Artoria2e5 commented 7 years ago

"Overlong" means unnecessary long.

buganini commented 7 years ago

With/without CESU and strict/loose UTF-8, which combination should be the default behavior? All 4 combinations can be in the same decoder https://github.com/buganini/bsdconv/blob/master/modules/from/_UTF-8.c with parameters like _UTF-8#strict/_UTF-8#loose/_UTF-8#cesu/_UTF-8#nocesu

Artoria2e5 commented 7 years ago

"Strict" UTF-8 w/o CESU should be made default.