Open natebosch opened 4 years ago
There is no great need for a converter, since String.codeUnits
and String.fromCharCodes
will do the job, but that also means that it should be trivial to implement. It might make sense in some situations, e.g., to use with Stream.transform
.
The plain UTF-16 converter should not be an Encoding
since it doesn't output bytes.
It might actually make sense to have specialized UTF-16 little-endian and UTF-16 big-endian converters, which might even be Encodings
(but it's probably safest to keep them as plain Converter
s).
There is a difference between the package:utf
implementation of decodeUtf16
and using String.fromCharCodes
.
The former could decode the bytes [0xFE, 0xFF, 0x6C, 0x34]
into 水
. To get the same character using String.fromCharCodes
you need to change from bytes to charcodes first, it wants as input [0xFEFF, 0x6C34]
.
Exactly, that's what i was alluding to with a UTF-16 little/big-endian converter, which is a byte to string converter, not a code-unit to string converter. Your example appears to be big-endian (aka network-order).
I would expect a plain Utf16Converter
to convert from UTF-16 to String, and UTF-16 is code units, not bytes representing code units.
We can do all of these, but the endian-based converters are likely more useful.
It seems that we do have a few uses of UTF-16 decoding internally. It's definitely not a high priority (there are literally about 4 uses of this), but we'll probably want to have something in dart:convert
to support this use case, as it will block the null safety migration eventually.
This is very useful for Emoji support. Simple task to highlight search emoji in the text text to much time/code lines without UTF16 support
I extracted these lines of code from utf
library (https://pub.dev/packages/utf). Not sure about legal requirements. The code is licensed under BSD-3.
/// Invalid codepoints or encodings may be substituted with the value U+fffd.
const int _UNICODE_REPLACEMENT_CHARACTER_CODEPOINT = 0xfffd;
const int _UNICODE_BYTE_ZERO_MASK = 0xff;
const int _UNICODE_BYTE_ONE_MASK = 0xff00;
const int _UNICODE_VALID_RANGE_MAX = 0x10ffff;
const int _UNICODE_PLANE_ONE_MAX = 0xffff;
const int _UNICODE_UTF16_RESERVED_LO = 0xd800;
const int _UNICODE_UTF16_RESERVED_HI = 0xdfff;
const int _UNICODE_UTF16_OFFSET = 0x10000;
const int _UNICODE_UTF16_SURROGATE_UNIT_0_BASE = 0xd800;
const int _UNICODE_UTF16_SURROGATE_UNIT_1_BASE = 0xdc00;
const int _UNICODE_UTF16_HI_MASK = 0xffc00;
const int _UNICODE_UTF16_LO_MASK = 0x3ff;
/// Produce a list of UTF-16LE encoded bytes. This method produces UTF-16LE
/// bytes with no BOM.
List<int> encodeUtf16le(String str) {
final utf16CodeUnits = _stringToUtf16CodeUnits(str);
final encoding = List<int>.filled(2 * utf16CodeUnits.length, -1);
var i = 0;
for (final unit in utf16CodeUnits) {
encoding[i++] = unit & _UNICODE_BYTE_ZERO_MASK;
encoding[i++] = (unit & _UNICODE_BYTE_ONE_MASK) >> 8;
}
return encoding;
}
List<int> _stringToUtf16CodeUnits(String str) {
return codepointsToUtf16CodeUnits(str.codeUnits);
}
/// Encode code points as UTF16 code units.
List<int> codepointsToUtf16CodeUnits(List<int> codepoints,
{int offset = 0,
int? length,
int replacementCodepoint = _UNICODE_REPLACEMENT_CHARACTER_CODEPOINT}) {
final listRange = codepoints;
var encodedLength = 0;
for (final value in listRange) {
if ((value >= 0 && value < _UNICODE_UTF16_RESERVED_LO) ||
(value > _UNICODE_UTF16_RESERVED_HI && value <= _UNICODE_PLANE_ONE_MAX)) {
encodedLength++;
} else if (value > _UNICODE_PLANE_ONE_MAX &&
value <= _UNICODE_VALID_RANGE_MAX) {
encodedLength += 2;
} else {
encodedLength++;
}
}
final codeUnitsBuffer = List<int>.filled(encodedLength, -1);
var j = 0;
for (final value in listRange) {
if ((value >= 0 && value < _UNICODE_UTF16_RESERVED_LO) ||
(value > _UNICODE_UTF16_RESERVED_HI && value <= _UNICODE_PLANE_ONE_MAX)) {
codeUnitsBuffer[j++] = value;
} else if (value > _UNICODE_PLANE_ONE_MAX &&
value <= _UNICODE_VALID_RANGE_MAX) {
var base = value - _UNICODE_UTF16_OFFSET;
codeUnitsBuffer[j++] = _UNICODE_UTF16_SURROGATE_UNIT_0_BASE +
((base & _UNICODE_UTF16_HI_MASK) >> 10);
codeUnitsBuffer[j++] =
_UNICODE_UTF16_SURROGATE_UNIT_1_BASE + (base & _UNICODE_UTF16_LO_MASK);
} else {
codeUnitsBuffer[j++] = replacementCodepoint;
}
}
return codeUnitsBuffer;
}
The only way we had to decode UTF-16 previously was
package:utf
which has been discontinued. We should add autf16
encoder and decoder here.