dart-lang / core

This repository is home to core Dart packages.
https://pub.dev/publishers/dart.dev
BSD 3-Clause "New" or "Revised" License
19 stars 7 forks source link

Add support for UTF-16 #266

Open natebosch opened 4 years ago

natebosch commented 4 years ago

The only way we had to decode UTF-16 previously was package:utf which has been discontinued. We should add a utf16 encoder and decoder here.

lrhn commented 4 years ago

There is no great need for a converter, since String.codeUnits and String.fromCharCodes will do the job, but that also means that it should be trivial to implement. It might make sense in some situations, e.g., to use with Stream.transform.

The plain UTF-16 converter should not be an Encoding since it doesn't output bytes. It might actually make sense to have specialized UTF-16 little-endian and UTF-16 big-endian converters, which might even be Encodings (but it's probably safest to keep them as plain Converters).

natebosch commented 4 years ago

There is a difference between the package:utf implementation of decodeUtf16 and using String.fromCharCodes.

The former could decode the bytes [0xFE, 0xFF, 0x6C, 0x34] into . To get the same character using String.fromCharCodes you need to change from bytes to charcodes first, it wants as input [0xFEFF, 0x6C34].

lrhn commented 4 years ago

Exactly, that's what i was alluding to with a UTF-16 little/big-endian converter, which is a byte to string converter, not a code-unit to string converter. Your example appears to be big-endian (aka network-order).

I would expect a plain Utf16Converter to convert from UTF-16 to String, and UTF-16 is code units, not bytes representing code units. We can do all of these, but the endian-based converters are likely more useful.

michalt commented 3 years ago

It seems that we do have a few uses of UTF-16 decoding internally. It's definitely not a high priority (there are literally about 4 uses of this), but we'll probably want to have something in dart:convert to support this use case, as it will block the null safety migration eventually.

Dersh commented 2 years ago

This is very useful for Emoji support. Simple task to highlight search emoji in the text text to much time/code lines without UTF16 support

timobaehr commented 1 year ago

I extracted these lines of code from utf library (https://pub.dev/packages/utf). Not sure about legal requirements. The code is licensed under BSD-3.

/// Invalid codepoints or encodings may be substituted with the value U+fffd.
const int _UNICODE_REPLACEMENT_CHARACTER_CODEPOINT = 0xfffd;

const int _UNICODE_BYTE_ZERO_MASK = 0xff;
const int _UNICODE_BYTE_ONE_MASK = 0xff00;

const int _UNICODE_VALID_RANGE_MAX = 0x10ffff;
const int _UNICODE_PLANE_ONE_MAX = 0xffff;

const int _UNICODE_UTF16_RESERVED_LO = 0xd800;
const int _UNICODE_UTF16_RESERVED_HI = 0xdfff;
const int _UNICODE_UTF16_OFFSET = 0x10000;
const int _UNICODE_UTF16_SURROGATE_UNIT_0_BASE = 0xd800;
const int _UNICODE_UTF16_SURROGATE_UNIT_1_BASE = 0xdc00;
const int _UNICODE_UTF16_HI_MASK = 0xffc00;
const int _UNICODE_UTF16_LO_MASK = 0x3ff;

/// Produce a list of UTF-16LE encoded bytes. This method produces UTF-16LE
/// bytes with no BOM.
List<int> encodeUtf16le(String str) {
  final utf16CodeUnits = _stringToUtf16CodeUnits(str);
  final encoding = List<int>.filled(2 * utf16CodeUnits.length, -1);
  var i = 0;
  for (final unit in utf16CodeUnits) {
    encoding[i++] = unit & _UNICODE_BYTE_ZERO_MASK;
    encoding[i++] = (unit & _UNICODE_BYTE_ONE_MASK) >> 8;
  }
  return encoding;
}

List<int> _stringToUtf16CodeUnits(String str) {
  return codepointsToUtf16CodeUnits(str.codeUnits);
}

/// Encode code points as UTF16 code units.
List<int> codepointsToUtf16CodeUnits(List<int> codepoints,
    {int offset = 0,
      int? length,
      int replacementCodepoint = _UNICODE_REPLACEMENT_CHARACTER_CODEPOINT}) {
  final listRange = codepoints;
  var encodedLength = 0;
  for (final value in listRange) {
    if ((value >= 0 && value < _UNICODE_UTF16_RESERVED_LO) ||
        (value > _UNICODE_UTF16_RESERVED_HI && value <= _UNICODE_PLANE_ONE_MAX)) {
      encodedLength++;
    } else if (value > _UNICODE_PLANE_ONE_MAX &&
        value <= _UNICODE_VALID_RANGE_MAX) {
      encodedLength += 2;
    } else {
      encodedLength++;
    }
  }

  final codeUnitsBuffer = List<int>.filled(encodedLength, -1);
  var j = 0;
  for (final value in listRange) {
    if ((value >= 0 && value < _UNICODE_UTF16_RESERVED_LO) ||
        (value > _UNICODE_UTF16_RESERVED_HI && value <= _UNICODE_PLANE_ONE_MAX)) {
      codeUnitsBuffer[j++] = value;
    } else if (value > _UNICODE_PLANE_ONE_MAX &&
        value <= _UNICODE_VALID_RANGE_MAX) {
      var base = value - _UNICODE_UTF16_OFFSET;
      codeUnitsBuffer[j++] = _UNICODE_UTF16_SURROGATE_UNIT_0_BASE +
          ((base & _UNICODE_UTF16_HI_MASK) >> 10);
      codeUnitsBuffer[j++] =
          _UNICODE_UTF16_SURROGATE_UNIT_1_BASE + (base & _UNICODE_UTF16_LO_MASK);
    } else {
      codeUnitsBuffer[j++] = replacementCodepoint;
    }
  }
  return codeUnitsBuffer;
}