unicode-segmenter

A lightweight and fast, pure JavaScript library for Unicode segmentation.

Features

unicode-segmenter includes utilities to deal with:

Emojis and pictographic ⤵
Extended grapheme clusters ⤵
Non-Latin alphabets and numbers ⤵
UTF-8 characters and UTF-16 surrogates ⤵
Intl.Segmenter Polyfill ⤵

With no dependencies, so you can use it even in places where built-in Unicode libraries aren't available, such as old browsers, edge runtimes, and embedded environments.

Unicode® version

Unicode® 15.1.0 Standard Annex #29 Revision 43 (2023-08-16)

Runtime compatibility

unicode-segmenter uses most basic ES6+ features like generators, modules and String.prototype.codePointAt().

Those are available in lightweight JS runtimes like QuickJS as well as (not very) modern browsers. You can still use the library even in IE11 by transpiling/polyfilling them using Babel, regenerator, etc.

Usage

Using TypeScript

No worry. The project is fully type-checked, and provides *.d.ts for you 😉

Export `unicode-segmenter/emoji`

Utilities for matching emoji-like characters

Example: Use Unicode emoji property matches

import {
  isEmojiPresentation,    // match \p{Emoji_Presentation}
  isExtendedPictographic, // match \p{Extended_Pictographic}
} from 'unicode-segmenter/emoji';

isEmojiPresentation('😍'.codePointAt(0));
// => true
isEmojiPresentation('♡'.codePointAt(0));
// => false

isExtendedPictographic('😍'.codePointAt(0));
// => true
isExtendedPictographic('♡'.codePointAt(0));
// => true

Export `unicode-segmenter/general`

Utilities for matching alphanumeric characters

Example: Use Unicode general property matchers

import {
  isLetter,       // match \p{L}
  isNumeric,      // match \p{N}
  isAlphabetic,   // match \p{Alphabetic}
  isAlphanumeric, // match [\p{N}\p{Alphabetic}]
} from 'unicode-segmenter/general';

Export `unicode-segmenter/grapheme`

Utilities for text segmentation by extended grapheme cluster rules

Example: Count graphemes

import { countGrapheme } from 'unicode-segmenter/grapheme';

'👋 안녕!'.length;
// => 6
countGrapheme('👋 안녕!');
// => 5

'a̐éö̲'.length;
// => 7
countGrapheme('a̐éö̲');
// => 3

Example: Get grapheme segments

import { graphemeSegments } from 'unicode-segmenter/grapheme';

[...graphemeSegments('a̐éö̲\r\n')];
// 0: { segment: 'a̐', index: 0, input: 'a̐éö̲\r\n' }
// 1: { segment: 'é', index: 2, input: 'a̐éö̲\r\n' }
// 2: { segment: 'ö̲', index: 4, input: 'a̐éö̲\r\n' }
// 3: { segment: '\r\n', index: 7, input: 'a̐éö̲\r\n' }

Example: Build an advanced grapheme matcher

graphemeSegments() exposes some knowledge identified in the middle of the process to support some useful cases.

For example, knowing the Grapheme_Cluster_Break category at the beginning and end of a segment can help approximately infer the applied boundary rule.

import { graphemeSegments, GraphemeCategory } from 'unicode-segmenter/grapheme';

function* matchEmoji(str) {
  for (const { segment, _catBegin } of graphemeSegments(input)) {
    // `_catBegin` identified as Extended_Pictographic means the segment is emoji
    if (_catBegin === GraphemeCategory.Extended_Pictographic) {
      yield segment;
    }
  }
}

[...matchEmoji('1🌷2🎁3💩4😜5👍')]
// 0: 🌷
// 1: 🎁
// 2: 💩
// 3: 😜
// 4: 👍

Export `unicode-segmenter/intl-adapter`

Intl.Segmenter API adapter (only granularity: "grapheme" available yet)

import { Segmenter } from 'unicode-segmenter/intl-adapter';

// Same API with the `Intl.Segmenter`
const segmenter = new Segmenter();

Export `unicode-segmenter/intl-polyfill`

Intl.Segmenter API polyfill (only granularity: "grapheme" available yet)

// Apply polyfill to the `globalThis.Intl` object.
import 'unicode-segmenter/intl-polyfill';

const segmenter = new Intl.Segmenter();

Export `unicode-segmenter/utils`

You can access some internal utilities to deal with UTF-8 in the JavaScript

Example: Handle UTF-16 surrogate pairs

import {
  isHighSurrogate,
  isLowSurrogate,
  surrogatePairToCodePoint,
} from 'unicode-segmenter/utils';

const u32 = '😍';
const hi = u32.charCodeAt(0);
const lo = u32.charCodeAt(1);

if (isHighSurrogate(hi) && isLowSurrogate(lo)) {
  const codePoint = surrogatePairToCodePoint(hi, lo);
  // => equivalent to u32.codePointAt(0)
}

Example: Determine the length of a character

import { isBMP } from 'unicode-segmenter/utils';

const char = '😍'; // .length = 2
const cp = char.codePointAt(0);

char.length === isBMP(cp) ? 1 : 2;
// => true

Benchmarks

unicode-segmenter aims to be lighter and faster than alternatives in the ecosystem while fully spec compliant. So the benchmark is tracking the performance, bundle size, and Unicode version compliance of several libraries.

Look benchmark to see how it works.

`unicode-segmenter/emoji` vs

built-in Unicode RegExp
emoji-regex@10.3.0 (101M+ weekly downloads on NPM)
emojibase-regex@15.3.2 (192K+ weekly downloads on NPM)

Package stats

Name	Unicode®	ESM?	Size	Size (min)	Size (min+gzip)	Size (min+br)
`unicode-segmenter/emoji`	15.1.0	✔️	3,058	2,611	1,041	751
`emoji-regex`	15.1.0 (vary)*	✔️	12,946	12,859	2,180	1,746
`emojibase-regex`*	15.1.0	✖️	17,711	16,595	2,870	2,317
`emojibase-regex/emoji`*	15.1.0	✖️	13,550	12,458	2,835	2,210
`RegExp` w/ `u`*	-	-	0	0	0	0

You can build your own emoji-regex using emoji-test-regex-pattern.
emojibase-regex matches Extended_Pictographic property.
emojibase-regex/emoji matches only Emoji_Presentation property.
RegExp Unicode data is always kept up to date as the runtime support.
RegExp Unicode may not be available in some old browsers, edge runtimes, or embedded environments.

Runtime performance

The runtime performance of unicode-segmenter/emoji is enough to test the presence of emoji in a text.

It's \~3x worse than RegExp w/ u for match-all performance, but that's not a good example because that doesn't care about grapheme clusters.

You can handle emojis in between grapheme processing by unicode-segmenter/grapheme. It's a bit less performant than the dedicated emoji matchers, but it's not that worse, and actually reasonable in the real world.

Details

`unicode-segmenter/general` vs

built-in unicode RegExp
XRegExp@5.1.1 (2.8M+ weekly downloads on NPM)

Package stats

Name	Unicode®	ESM?	Size	Size (min)	Size (min+gzip)	Size (min+br)
`unicode-segmenter/general`	15.1.0	✔️	21,505	20,972	5,792	3,564
`XRegExp`	14.0.0	✖️ ️	383,156	194,202	62,986	39,871
`RegExp` w/ `u`*	-	-	0	0	0	0

RegExp Unicode data is always kept up to date as the runtime support.
RegExp Unicode may not be available in some old browsers, edge runtimes, or embedded environments.

Runtime performance

Depending on your usage, unicode-segmenter/general may be slightly faster than RegExp w/ u and suitable for more advanced use cases.

Details

`unicode-segmenter/grapheme` vs

Node.js' Intl.Segmenter (browser's version may vary)
graphemer@1.4.0 (16.6M+ weekly downloads on NPM)
grapheme-splitter@1.0.4 (5.7M+ weekly downloads on NPM)
@formatjs/intl-segmenter@11.5.7 (5.4K+ weekly downloads on NPM)
WebAssembly build of the Rust unicode-segmentation library

Package stats

Name	Unicode®	ESM?	Size	Size (min)	Size (min+gzip)	Size (min+br)
`unicode-segmenter/grapheme`	15.1.0	✔️	33,307	29,712	9,364	5,675
`graphemer`	15.0.0	✖️ ️	410,424	95,104	15,752	10,660
`grapheme-splitter`	10.0.0	✖️	122,241	23,680	7,852	4,841
`unicode-segmentation`*	15.0.0	✔️	51,251	51,251	22,545	16,614
`@formatjs/intl-segmenter`*	15.0.0	✖️	492,803	319,109	54,346	34,365
`Intl.Segmenter`*	-	-	0	0	0	0

unicode-segmentation size contains only the minimum WASM binary. It will be larger by adding more bindings.
@formatjs/intl-segmenter handles grapheme, word, sentence, but it's not tree-shakable.
Intl.Segmenter's Unicode data is always kept up to date as the runtime support.
Intl.Segmenter may not be available in some old browsers, edge runtimes, or embedded environments.

Runtime performance

unicode-segmenter/grapheme is 7\~18x faster than other JS alternatives, 3\~8x faster than native Intl.Segmenter), and 1.5\~3x faster than WASM build of the Rust unicode-segmentation library.

The gap may increase depending on the environment. Bindings for browsers generally appear to perform worse. In most environments, unicode-segmenter/grapheme is over 6x faster than graphemer.

Details

``` cpu: Apple M1 Pro runtime: node v20.13.1 (arm64-darwin) benchmark time (avg) (min … max) p75 p99 p999 ----------------------------------------------------------------------------------- ----------------------------- • Lorem ipsum (ascii) ----------------------------------------------------------------------------------- ----------------------------- unicode-segmenter/grapheme 5'668 ns/iter (5'332 ns … 6'582 ns) 5'778 ns 6'326 ns 6'582 ns Intl.Segmenter 51'811 ns/iter (47'208 ns … 524 µs) 51'917 ns 61'708 ns 436 µs graphemer 49'103 ns/iter (46'583 ns … 280 µs) 48'625 ns 101 µs 182 µs grapheme-splitter 123 µs/iter (117 µs … 1'066 µs) 122 µs 171 µs 816 µs unicode-rs/unicode-segmentation (wasm-pack) 16'935 ns/iter (15'542 ns … 274 µs) 16'542 ns 30'084 ns 130 µs @formatjs/intl-segmenter 42'689 ns/iter (38'792 ns … 941 µs) 41'875 ns 106 µs 216 µs summary for Lorem ipsum (ascii) unicode-segmenter/grapheme 2.99x faster than unicode-rs/unicode-segmentation (wasm-pack) 7.53x faster than @formatjs/intl-segmenter 8.66x faster than graphemer 9.14x faster than Intl.Segmenter 21.63x faster than grapheme-splitter • Emojis ----------------------------------------------------------------------------------- ----------------------------- unicode-segmenter/grapheme 1'717 ns/iter (1'656 ns … 1'941 ns) 1'727 ns 1'939 ns 1'941 ns Intl.Segmenter 14'715 ns/iter (12'334 ns … 1'301 µs) 13'792 ns 20'000 ns 820 µs graphemer 13'752 ns/iter (12'625 ns … 1'385 µs) 13'583 ns 22'875 ns 136 µs grapheme-splitter 27'406 ns/iter (26'625 ns … 427 µs) 26'958 ns 32'333 ns 69'042 ns unicode-rs/unicode-segmentation (wasm-pack) 5'728 ns/iter (5'497 ns … 12'383 ns) 5'711 ns 6'953 ns 12'383 ns @formatjs/intl-segmenter 14'579 ns/iter (13'541 ns … 377 µs) 14'541 ns 19'583 ns 166 µs summary for Emojis unicode-segmenter/grapheme 3.34x faster than unicode-rs/unicode-segmentation (wasm-pack) 8.01x faster than graphemer 8.49x faster than @formatjs/intl-segmenter 8.57x faster than Intl.Segmenter 15.96x faster than grapheme-splitter • Demonic characters ----------------------------------------------------------------------------------- ----------------------------- unicode-segmenter/grapheme 1'699 ns/iter (1'636 ns … 1'986 ns) 1'719 ns 1'891 ns 1'986 ns Intl.Segmenter 5'088 ns/iter (3'501 ns … 9'109 ns) 7'867 ns 9'083 ns 9'109 ns graphemer 27'386 ns/iter (26'333 ns … 332 µs) 26'958 ns 30'333 ns 161 µs grapheme-splitter 19'959 ns/iter (18'958 ns … 380 µs) 19'500 ns 24'333 ns 247 µs unicode-rs/unicode-segmentation (wasm-pack) 2'518 ns/iter (2'444 ns … 4'894 ns) 2'534 ns 2'839 ns 4'894 ns @formatjs/intl-segmenter 17'272 ns/iter (16'708 ns … 231 µs) 17'375 ns 18'541 ns 39'000 ns summary for Demonic characters unicode-segmenter/grapheme 1.48x faster than unicode-rs/unicode-segmentation (wasm-pack) 2.99x faster than Intl.Segmenter 10.16x faster than @formatjs/intl-segmenter 11.74x faster than grapheme-splitter 16.11x faster than graphemer • Tweet text (combined) ----------------------------------------------------------------------------------- ----------------------------- unicode-segmenter/grapheme 8'025 ns/iter (7'867 ns … 8'619 ns) 8'168 ns 8'614 ns 8'619 ns Intl.Segmenter 70'021 ns/iter (63'667 ns … 562 µs) 69'875 ns 79'458 ns 519 µs graphemer 69'922 ns/iter (66'583 ns … 320 µs) 69'708 ns 92'875 ns 271 µs grapheme-splitter 152 µs/iter (147 µs … 467 µs) 153 µs 165 µs 429 µs unicode-rs/unicode-segmentation (wasm-pack) 24'428 ns/iter (23'583 ns … 302 µs) 24'084 ns 27'334 ns 157 µs @formatjs/intl-segmenter 64'112 ns/iter (61'333 ns … 338 µs) 63'083 ns 88'625 ns 272 µs summary for Tweet text (combined) unicode-segmenter/grapheme 3.04x faster than unicode-rs/unicode-segmentation (wasm-pack) 7.99x faster than @formatjs/intl-segmenter 8.71x faster than graphemer 8.72x faster than Intl.Segmenter 18.91x faster than grapheme-splitter • Code snippet (combined) ----------------------------------------------------------------------------------- ----------------------------- unicode-segmenter/grapheme 19'661 ns/iter (18'125 ns … 350 µs) 19'458 ns 24'708 ns 185 µs Intl.Segmenter 158 µs/iter (148 µs … 443 µs) 158 µs 323 µs 428 µs graphemer 163 µs/iter (159 µs … 401 µs) 161 µs 284 µs 390 µs grapheme-splitter 350 µs/iter (343 µs … 712 µs) 348 µs 424 µs 705 µs unicode-rs/unicode-segmentation (wasm-pack) 57'376 ns/iter (55'917 ns … 300 µs) 56'667 ns 67'959 ns 209 µs @formatjs/intl-segmenter 150 µs/iter (142 µs … 579 µs) 150 µs 310 µs 475 µs summary for Code snippet (combined) unicode-segmenter/grapheme 2.92x faster than unicode-rs/unicode-segmentation (wasm-pack) 7.65x faster than @formatjs/intl-segmenter 8.03x faster than Intl.Segmenter 8.3x faster than graphemer 17.79x faster than grapheme-splitter ```

LICENSE

MIT

[!NOTE] The initial implementation was ported manually from Rust's unicode-segmentation library, which is licenced under the MIT license.

cometkim / unicode-segmenter

readme

unicode-segmenter

Features

Unicode® version

Runtime compatibility

Usage

Using TypeScript

Export `unicode-segmenter/emoji`

Example: Use Unicode emoji property matches

Export `unicode-segmenter/general`

Example: Use Unicode general property matchers

Export `unicode-segmenter/grapheme`

Example: Count graphemes

Example: Get grapheme segments

Example: Build an advanced grapheme matcher

Export `unicode-segmenter/intl-adapter`

Export `unicode-segmenter/intl-polyfill`

Export `unicode-segmenter/utils`

Example: Handle UTF-16 surrogate pairs

Example: Determine the length of a character

Benchmarks

`unicode-segmenter/emoji` vs

Package stats

Runtime performance

`unicode-segmenter/general` vs

Package stats

Runtime performance

`unicode-segmenter/grapheme` vs

Package stats

Runtime performance

LICENSE

cometkim / unicode-segmenter

readme

unicode-segmenter

Features

Unicode® version

Runtime compatibility

Usage

Using TypeScript

Export unicode-segmenter/emoji

Example: Use Unicode emoji property matches

Export unicode-segmenter/general

Example: Use Unicode general property matchers

Export unicode-segmenter/grapheme

Example: Count graphemes

Example: Get grapheme segments

Example: Build an advanced grapheme matcher

Export unicode-segmenter/intl-adapter

Export unicode-segmenter/intl-polyfill

Export unicode-segmenter/utils

Example: Handle UTF-16 surrogate pairs

Example: Determine the length of a character

Benchmarks

unicode-segmenter/emoji vs

Package stats

Runtime performance

unicode-segmenter/general vs

Package stats

Runtime performance

unicode-segmenter/grapheme vs

Package stats

Runtime performance

LICENSE

Export `unicode-segmenter/emoji`

Export `unicode-segmenter/general`

Export `unicode-segmenter/grapheme`

Export `unicode-segmenter/intl-adapter`

Export `unicode-segmenter/intl-polyfill`

Export `unicode-segmenter/utils`

`unicode-segmenter/emoji` vs

`unicode-segmenter/general` vs

`unicode-segmenter/grapheme` vs