cometkim / unicode-segmenter

A lightweight and fast, pure JavaScript library for Unicode segmentation
MIT License
37 stars 0 forks source link
emoji grapheme grapheme-cluster unicode utf-16 utf-8

unicode-segmenter

Package Version Integration codecov LICENSE - MIT

A lightweight and fast, pure JavaScript library for Unicode segmentation.

Features

unicode-segmenter includes utilities to deal with:

With no dependencies, so you can use it even in places where built-in Unicode libraries aren't available, such as old browsers, edge runtimes, and embedded environments.

Unicode® version

Unicode® 15.1.0 Standard Annex #29 Revision 43 (2023-08-16)

Runtime compatibility

unicode-segmenter uses most basic ES6+ features like generators, modules and String.prototype.codePointAt().

Those are available in lightweight JS runtimes like QuickJS as well as (not very) modern browsers. You can still use the library even in IE11 by transpiling/polyfilling them using Babel, regenerator, etc.

Usage

Using TypeScript

No worry. The project is fully type-checked, and provides *.d.ts for you 😉

Export unicode-segmenter/emoji

Utilities for matching emoji-like characters

Example: Use Unicode emoji property matches

import {
  isEmojiPresentation,    // match \p{Emoji_Presentation}
  isExtendedPictographic, // match \p{Extended_Pictographic}
} from 'unicode-segmenter/emoji';

isEmojiPresentation('😍'.codePointAt(0));
// => true
isEmojiPresentation('♡'.codePointAt(0));
// => false

isExtendedPictographic('😍'.codePointAt(0));
// => true
isExtendedPictographic('♡'.codePointAt(0));
// => true

Export unicode-segmenter/general

Utilities for matching alphanumeric characters

Example: Use Unicode general property matchers

import {
  isLetter,       // match \p{L}
  isNumeric,      // match \p{N}
  isAlphabetic,   // match \p{Alphabetic}
  isAlphanumeric, // match [\p{N}\p{Alphabetic}]
} from 'unicode-segmenter/general';

Export unicode-segmenter/grapheme

Utilities for text segmentation by extended grapheme cluster rules

Example: Count graphemes

import { countGrapheme } from 'unicode-segmenter/grapheme';

'👋 안녕!'.length;
// => 6
countGrapheme('👋 안녕!');
// => 5

'a̐éö̲'.length;
// => 7
countGrapheme('a̐éö̲');
// => 3

Example: Get grapheme segments

import { graphemeSegments } from 'unicode-segmenter/grapheme';

[...graphemeSegments('a̐éö̲\r\n')];
// 0: { segment: 'a̐', index: 0, input: 'a̐éö̲\r\n' }
// 1: { segment: 'é', index: 2, input: 'a̐éö̲\r\n' }
// 2: { segment: 'ö̲', index: 4, input: 'a̐éö̲\r\n' }
// 3: { segment: '\r\n', index: 7, input: 'a̐éö̲\r\n' }

Example: Build an advanced grapheme matcher

graphemeSegments() exposes some knowledge identified in the middle of the process to support some useful cases.

For example, knowing the Grapheme_Cluster_Break category at the beginning and end of a segment can help approximately infer the applied boundary rule.

import { graphemeSegments, GraphemeCategory } from 'unicode-segmenter/grapheme';

function* matchEmoji(str) {
  for (const { segment, _catBegin } of graphemeSegments(input)) {
    // `_catBegin` identified as Extended_Pictographic means the segment is emoji
    if (_catBegin === GraphemeCategory.Extended_Pictographic) {
      yield segment;
    }
  }
}

[...matchEmoji('1🌷2🎁3💩4😜5👍')]
// 0: 🌷
// 1: 🎁
// 2: 💩
// 3: 😜
// 4: 👍

Export unicode-segmenter/intl-adapter

Intl.Segmenter API adapter (only granularity: "grapheme" available yet)

import { Segmenter } from 'unicode-segmenter/intl-adapter';

// Same API with the `Intl.Segmenter`
const segmenter = new Segmenter();

Export unicode-segmenter/intl-polyfill

Intl.Segmenter API polyfill (only granularity: "grapheme" available yet)

// Apply polyfill to the `globalThis.Intl` object.
import 'unicode-segmenter/intl-polyfill';

const segmenter = new Intl.Segmenter();

Export unicode-segmenter/utils

You can access some internal utilities to deal with UTF-8 in the JavaScript

Example: Handle UTF-16 surrogate pairs

import {
  isHighSurrogate,
  isLowSurrogate,
  surrogatePairToCodePoint,
} from 'unicode-segmenter/utils';

const u32 = '😍';
const hi = u32.charCodeAt(0);
const lo = u32.charCodeAt(1);

if (isHighSurrogate(hi) && isLowSurrogate(lo)) {
  const codePoint = surrogatePairToCodePoint(hi, lo);
  // => equivalent to u32.codePointAt(0)
}

Example: Determine the length of a character

import { isBMP } from 'unicode-segmenter/utils';

const char = '😍'; // .length = 2
const cp = char.codePointAt(0);

char.length === isBMP(cp) ? 1 : 2;
// => true

Benchmarks

unicode-segmenter aims to be lighter and faster than alternatives in the ecosystem while fully spec compliant. So the benchmark is tracking the performance, bundle size, and Unicode version compliance of several libraries.

Look benchmark to see how it works.

unicode-segmenter/emoji vs

Package stats

Name Unicode® ESM? Size Size (min) Size (min+gzip) Size (min+br)
unicode-segmenter/emoji 15.1.0 ✔️ 3,058 2,611 1,041 751
emoji-regex 15.1.0 (vary)* ✔️ 12,946 12,859 2,180 1,746
emojibase-regex* 15.1.0 ✖️ 17,711 16,595 2,870 2,317
emojibase-regex/emoji* 15.1.0 ✖️ 13,550 12,458 2,835 2,210
RegExp w/ u* - - 0 0 0 0

Runtime performance

The runtime performance of unicode-segmenter/emoji is enough to test the presence of emoji in a text.

It's \~3x worse than RegExp w/ u for match-all performance, but that's not a good example because that doesn't care about grapheme clusters.

You can handle emojis in between grapheme processing by unicode-segmenter/grapheme. It's a bit less performant than the dedicated emoji matchers, but it's not that worse, and actually reasonable in the real world.

Details ``` cpu: Apple M1 Pro runtime: node v20.13.1 (arm64-darwin) benchmark time (avg) (min … max) p75 p99 p999 ------------------------------------------------------------------ ----------------------------- • checking if any emoji (Extended_Pictographic) ------------------------------------------------------------------ ----------------------------- unicode-segmenter/emoji 14.2 ns/iter (13.41 ns … 213 ns) 14.26 ns 17.09 ns 36.7 ns unicode-segmenter/grapheme 88.82 ns/iter (76.01 ns … 479 ns) 94.75 ns 117 ns 282 ns RegExp w/ unicode 15.19 ns/iter (14.87 ns … 78.88 ns) 15.04 ns 18.62 ns 30.84 ns emoji-regex 40.85 ns/iter (40.53 ns … 73.79 ns) 40.67 ns 45.49 ns 53.63 ns emojibase-regex 110 ns/iter (109 ns … 163 ns) 110 ns 123 ns 142 ns summary for checking if any emoji (Extended_Pictographic) unicode-segmenter/emoji 1.07x faster than RegExp w/ unicode 2.88x faster than emoji-regex 6.25x faster than unicode-segmenter/grapheme 7.78x faster than emojibase-regex • match all emoji (Extended_Pictographic) ------------------------------------------------------------------ ----------------------------- unicode-segmenter/emoji 2'754 ns/iter (2'583 ns … 495 µs) 2'709 ns 3'000 ns 10'500 ns unicode-segmenter/grapheme 7'718 ns/iter (7'557 ns … 9'363 ns) 7'740 ns 8'729 ns 9'363 ns RegExp w/ unicode 959 ns/iter (932 ns … 1'196 ns) 967 ns 1'074 ns 1'196 ns emoji-regex 11'171 ns/iter (10'875 ns … 292 µs) 11'167 ns 12'209 ns 27'333 ns emojibase-regex 16'427 ns/iter (16'125 ns … 289 µs) 16'334 ns 17'750 ns 32'000 ns summary for match all emoji (Extended_Pictographic) unicode-segmenter/emoji 2.87x slower than RegExp w/ unicode 2.8x faster than unicode-segmenter/grapheme 4.06x faster than emoji-regex 5.96x faster than emojibase-regex ```

unicode-segmenter/general vs

Package stats

Name Unicode® ESM? Size Size (min) Size (min+gzip) Size (min+br)
unicode-segmenter/general 15.1.0 ✔️ 21,505 20,972 5,792 3,564
XRegExp 14.0.0 ✖️ ️ 383,156 194,202 62,986 39,871
RegExp w/ u* - - 0 0 0 0

Runtime performance

Depending on your usage, unicode-segmenter/general may be slightly faster than RegExp w/ u and suitable for more advanced use cases.

Details ``` cpu: Apple M1 Pro runtime: node v20.13.1 (arm64-darwin) benchmark time (avg) (min … max) p75 p99 p999 ----------------------------------------------------------------- ----------------------------- • checking any alphanumeric ----------------------------------------------------------------- ----------------------------- unicode-segmenter/general 212 ns/iter (206 ns … 413 ns) 215 ns 231 ns 272 ns XRegExp 243 ns/iter (237 ns … 394 ns) 245 ns 293 ns 320 ns RegExp w/ unicode 235 ns/iter (232 ns … 398 ns) 236 ns 259 ns 335 ns summary for checking any alphanumeric unicode-segmenter/general 1.11x faster than RegExp w/ unicode 1.15x faster than XRegExp • match all alphanumeric ----------------------------------------------------------------- ----------------------------- unicode-segmenter/general 340 ns/iter (330 ns … 992 ns) 351 ns 389 ns 992 ns XRegExp 1'928 ns/iter (1'900 ns … 2'014 ns) 1'938 ns 2'007 ns 2'014 ns RegExp w/ unicode 431 ns/iter (420 ns … 513 ns) 441 ns 488 ns 513 ns summary for match all alphanumeric unicode-segmenter/general 1.27x faster than RegExp w/ unicode 5.67x faster than XRegExp ```

unicode-segmenter/grapheme vs

Package stats

Name Unicode® ESM? Size Size (min) Size (min+gzip) Size (min+br)
unicode-segmenter/grapheme 15.1.0 ✔️ 33,307 29,712 9,364 5,675
graphemer 15.0.0 ✖️ ️ 410,424 95,104 15,752 10,660
grapheme-splitter 10.0.0 ✖️ 122,241 23,680 7,852 4,841
unicode-segmentation* 15.0.0 ✔️ 51,251 51,251 22,545 16,614
@formatjs/intl-segmenter* 15.0.0 ✖️ 492,803 319,109 54,346 34,365
Intl.Segmenter* - - 0 0 0 0

Runtime performance

unicode-segmenter/grapheme is 7\~18x faster than other JS alternatives, 3\~8x faster than native Intl.Segmenter), and 1.5\~3x faster than WASM build of the Rust unicode-segmentation library.

The gap may increase depending on the environment. Bindings for browsers generally appear to perform worse. In most environments, unicode-segmenter/grapheme is over 6x faster than graphemer.

Details ``` cpu: Apple M1 Pro runtime: node v20.13.1 (arm64-darwin) benchmark time (avg) (min … max) p75 p99 p999 ----------------------------------------------------------------------------------- ----------------------------- • Lorem ipsum (ascii) ----------------------------------------------------------------------------------- ----------------------------- unicode-segmenter/grapheme 5'668 ns/iter (5'332 ns … 6'582 ns) 5'778 ns 6'326 ns 6'582 ns Intl.Segmenter 51'811 ns/iter (47'208 ns … 524 µs) 51'917 ns 61'708 ns 436 µs graphemer 49'103 ns/iter (46'583 ns … 280 µs) 48'625 ns 101 µs 182 µs grapheme-splitter 123 µs/iter (117 µs … 1'066 µs) 122 µs 171 µs 816 µs unicode-rs/unicode-segmentation (wasm-pack) 16'935 ns/iter (15'542 ns … 274 µs) 16'542 ns 30'084 ns 130 µs @formatjs/intl-segmenter 42'689 ns/iter (38'792 ns … 941 µs) 41'875 ns 106 µs 216 µs summary for Lorem ipsum (ascii) unicode-segmenter/grapheme 2.99x faster than unicode-rs/unicode-segmentation (wasm-pack) 7.53x faster than @formatjs/intl-segmenter 8.66x faster than graphemer 9.14x faster than Intl.Segmenter 21.63x faster than grapheme-splitter • Emojis ----------------------------------------------------------------------------------- ----------------------------- unicode-segmenter/grapheme 1'717 ns/iter (1'656 ns … 1'941 ns) 1'727 ns 1'939 ns 1'941 ns Intl.Segmenter 14'715 ns/iter (12'334 ns … 1'301 µs) 13'792 ns 20'000 ns 820 µs graphemer 13'752 ns/iter (12'625 ns … 1'385 µs) 13'583 ns 22'875 ns 136 µs grapheme-splitter 27'406 ns/iter (26'625 ns … 427 µs) 26'958 ns 32'333 ns 69'042 ns unicode-rs/unicode-segmentation (wasm-pack) 5'728 ns/iter (5'497 ns … 12'383 ns) 5'711 ns 6'953 ns 12'383 ns @formatjs/intl-segmenter 14'579 ns/iter (13'541 ns … 377 µs) 14'541 ns 19'583 ns 166 µs summary for Emojis unicode-segmenter/grapheme 3.34x faster than unicode-rs/unicode-segmentation (wasm-pack) 8.01x faster than graphemer 8.49x faster than @formatjs/intl-segmenter 8.57x faster than Intl.Segmenter 15.96x faster than grapheme-splitter • Demonic characters ----------------------------------------------------------------------------------- ----------------------------- unicode-segmenter/grapheme 1'699 ns/iter (1'636 ns … 1'986 ns) 1'719 ns 1'891 ns 1'986 ns Intl.Segmenter 5'088 ns/iter (3'501 ns … 9'109 ns) 7'867 ns 9'083 ns 9'109 ns graphemer 27'386 ns/iter (26'333 ns … 332 µs) 26'958 ns 30'333 ns 161 µs grapheme-splitter 19'959 ns/iter (18'958 ns … 380 µs) 19'500 ns 24'333 ns 247 µs unicode-rs/unicode-segmentation (wasm-pack) 2'518 ns/iter (2'444 ns … 4'894 ns) 2'534 ns 2'839 ns 4'894 ns @formatjs/intl-segmenter 17'272 ns/iter (16'708 ns … 231 µs) 17'375 ns 18'541 ns 39'000 ns summary for Demonic characters unicode-segmenter/grapheme 1.48x faster than unicode-rs/unicode-segmentation (wasm-pack) 2.99x faster than Intl.Segmenter 10.16x faster than @formatjs/intl-segmenter 11.74x faster than grapheme-splitter 16.11x faster than graphemer • Tweet text (combined) ----------------------------------------------------------------------------------- ----------------------------- unicode-segmenter/grapheme 8'025 ns/iter (7'867 ns … 8'619 ns) 8'168 ns 8'614 ns 8'619 ns Intl.Segmenter 70'021 ns/iter (63'667 ns … 562 µs) 69'875 ns 79'458 ns 519 µs graphemer 69'922 ns/iter (66'583 ns … 320 µs) 69'708 ns 92'875 ns 271 µs grapheme-splitter 152 µs/iter (147 µs … 467 µs) 153 µs 165 µs 429 µs unicode-rs/unicode-segmentation (wasm-pack) 24'428 ns/iter (23'583 ns … 302 µs) 24'084 ns 27'334 ns 157 µs @formatjs/intl-segmenter 64'112 ns/iter (61'333 ns … 338 µs) 63'083 ns 88'625 ns 272 µs summary for Tweet text (combined) unicode-segmenter/grapheme 3.04x faster than unicode-rs/unicode-segmentation (wasm-pack) 7.99x faster than @formatjs/intl-segmenter 8.71x faster than graphemer 8.72x faster than Intl.Segmenter 18.91x faster than grapheme-splitter • Code snippet (combined) ----------------------------------------------------------------------------------- ----------------------------- unicode-segmenter/grapheme 19'661 ns/iter (18'125 ns … 350 µs) 19'458 ns 24'708 ns 185 µs Intl.Segmenter 158 µs/iter (148 µs … 443 µs) 158 µs 323 µs 428 µs graphemer 163 µs/iter (159 µs … 401 µs) 161 µs 284 µs 390 µs grapheme-splitter 350 µs/iter (343 µs … 712 µs) 348 µs 424 µs 705 µs unicode-rs/unicode-segmentation (wasm-pack) 57'376 ns/iter (55'917 ns … 300 µs) 56'667 ns 67'959 ns 209 µs @formatjs/intl-segmenter 150 µs/iter (142 µs … 579 µs) 150 µs 310 µs 475 µs summary for Code snippet (combined) unicode-segmenter/grapheme 2.92x faster than unicode-rs/unicode-segmentation (wasm-pack) 7.65x faster than @formatjs/intl-segmenter 8.03x faster than Intl.Segmenter 8.3x faster than graphemer 17.79x faster than grapheme-splitter ```

LICENSE

MIT

[!NOTE] The initial implementation was ported manually from Rust's unicode-segmentation library, which is licenced under the MIT license.