`text/slugify` gives empty results for non-Latin alphabets

denoland / std

The Deno Standard Library

https://jsr.io/@std

MIT License

3k stars 602 forks source link

`text/slugify` gives empty results for non-Latin alphabets #5830

Open lionel-rowe opened 3 weeks ago

lionel-rowe commented 3 weeks ago

Describe the bug

slugify gives readable results for only a subset of languages, namely those that use the Latin alphabet (English, Spanish, Vietnamese...). For all other languages (Chinese, Arabic, Russian...), it just returns an empty string.

Steps to Reproduce

slugify('三人行，必有我师焉') // returns ""

Expected behavior

slugify to return non-Latin-alphabet words unchanged (but still stripping start/end punctuation and replacing medial punctuation with dashes).

Something like the following:

function intlFriendlySlugify(str: string) {
    return str
        .toLowerCase()
        .normalize()
        .replaceAll(/[^\p{L}\p{M}\p{N}]+/gu, '-')
        .replaceAll(/^-|-$/g, '')
}

IMO the slugify function should not handle percent encoding, instead leaving that up to the calling code (e.g. passing to URL constructor or URLSearchParams):

const slug = intlFriendlySlugify('三人行，必有我师焉') // "三人行-必有我师焉"
const origin = 'https://example.com'
const url1 = new URL(origin)
url1.pathname = join('blog', slug)
const url2 = new URL(origin)
url2.searchParams.set('q', slug)
const url3 = new URL(origin)
url3.hostname = `${slug}.example.com`

While it's true they're unreadable in the percent-encoded/punycode form, modern browsers display the human-readable form automatically on hover, or in the address bar when you open them. At some point it might also be worth adding a prettifyUrl function to allow similar conversion in userland, but I'm leaving that out of scope for this issue.

Environment

OS: Ubuntu 20.04, WSL deno version: 1.46.0 std version: text@1.0.4

lionel-rowe commented 2 weeks ago

Here's a rundown of how various platforms handle non-ASCII text in slugs:

Site	Diacritics	Non-Latin	Notes
`stackoverflow.com`	cartão de credito → `cartão-de-credito` Unchanged	Python で特定の文字 → `python-で特定の文字` Unchanged
`wikipedia.org`	Maria Angélica Beraldo → `Maria_Angélica_Beraldo` Unchanged	佩通坦·钦那瓦 → `佩通坦·钦那瓦` Unchanged	Wikipedia uses `_` instead of `-` for spaces, but it's still a slug of sorts
`tumblr.com`	wüst → `wüst` Unchanged	Word of the Day: 久违 (Chinese) → `word-of-the-day-久违-chinese` Unchanged
`wordpress.org`	Actualización de mantenimiento → `actualizacion-de-mantenimiento` Stripped	WordPress 6.6「Dorsey」发布 → `wordpress-6-6dorsey发布` Unchanged
`github.com`	Introducción → `introducción` Unchanged	7. 兩岸詞典 /c/ → `7-兩岸詞典-c` Unchanged	Slugified section titles in README files in URL hash
`medium.com`	Cómo salirse -> `como-salirse` Stripped	跟我学中文! -> `跟我学中文` Unchanged
`dev.to`	2 años como Front-End Developer → `2-anos-como-front-end-developer` Stripped	データ・ストリーミング技術の概要 → `detasutoriminguji-shu-nogai-yao` Transliterated	This illustrates some of the problems with stripping diacritics and transliterating — 2 años means "2 years" whereas `2-anos` means `2-anuses`; meanwhile 技術 should be gijutsu in Japanese, not `ji-shu`

The transliteration option has one big advantage, namely that the URL remains legible in any context: plaintext files, IM platforms with limited rich-text features, etc. It also typically leads to shorter URLs compared to the percent-encoded version. Still, it's strictly worse when viewed in a browser address bar, adds a massive amount of complexity, including mappings for thousands of CJK characters, and often still leads to suboptimal results (as seen in the dev.to examples). That's probably why only dev.to uses it out of the 7 platforms I looked at.

As for diacritics, 3 of the 7 platforms strip them from Latin-script text, while the other 4 keep them. As with transliteration, stripping leads to more plaintext-friendly URLs; however, diacritics can be semantically important, also illustrated by the dev.to example.

timreichen commented 2 weeks ago

In the initial implementation it was discussed to port npm:slugify, so slugify('三人行，必有我师焉') === "" is actually expected.

The behavior you describe is probably better handled with @std/text/to-kebab-case

import { toKebabCase } from "@std/text/to-kebab-case";
console.log(toKebabCase("三人行-必有我师焉")); // "三人行-必有我师焉"

lionel-rowe commented 2 weeks ago

@timreichen In the state it was merged, that PR isn't a port of npm:slugify, because it doesn't include any char mapping. But more to the point, why is it more important to have parity with a specific NPM package rather than to have slugify be a general-purpose function for creating slugs, as implemented by sites like WordPress, Stack Overflow, Wikipedia, GitHub, Medium, and Tumblr? npm:slugify has multiple open issues about its lack of support for various languages.

And if toKebabCase already works reliably as a general-purpose slugify function (which I'm not sure about, haven't tested it), why expose a separate slugify that only works for a subset of use cases?

timreichen commented 2 weeks ago

@timreichen In the state it was merged, that PR isn't a port of npm:slugify, because it doesn't include any char mapping. But more to the point, why is it more important to have parity with a specific NPM package rather than to have slugify be a general-purpose function for creating slugs, as implemented by sites like WordPress, Stack Overflow, Wikipedia, GitHub, Medium, and Tumblr? npm:slugify has multiple open issues about its lack of support for various languages.

We removed the char mapping because the list was random. The problem is as you pointed out that there is no standard and the slugify functionality varies depending on the implementation.

And if toKebabCase already works reliably as a general-purpose slugify function (which I'm not sure about, haven't tested it), why expose a separate slugify that only works for a subset of use cases?

Every slugify function will be only work on a subset of use cases. That is why for example npm:slugify has so many options and one can add custom replacements etc. I think if there is a clean way to support other languages, that should be added. However, the slug must match [a-zA-Z0-9-]*.

lionel-rowe commented 2 weeks ago

However, the slug must match [a-zA-Z0-9-]*.

Why? Again, massive platforms like WordPress, Stack Overflow, Wikipedia, GitHub, Medium, and Tumblr don't obey that rule, and browsers and web APIs handle non-ASCII URL components perfectly fine. Allowing "" as a slug is far more risky sanitization-wise than allowing arbitrary non-ASCII text, because the path /a//b normalizes to /a/b (and additionally, /a/ is often normalized to /a). Meanwhile, non-ASCII text can never clash with reserved characters, which always fall within the printable ASCII range.

lowlighter commented 2 weeks ago

Maybe we could just change the signature of slugify so users can provide their own strip regex ?

function slugify(input: string, strip = /[^a-zA-Z0-9\s-]/g): string

This way it doesn't really add much more complexity while offering a bit more liberty to end users (which would know best which charset they'd like to support) ?

slugify("déjà-vu", /[^a-zA-Z0-9\s-À-ÖØ-öø-ÿ]/g) // "déjà-vu"

lionel-rowe commented 2 weeks ago

@lowlighter That seems to me like it's simultaneously too granular and not customizable enough. Too granular because I can't see any good reason why you'd want to allow some non-ASCII but not others; not customizable enough because it still doesn't provide any way of mapping.

Something like this could work:

// slugify.ts

export type SlugifyOptions = {
    /** @default {undefined} */
    charMap: Record<string, string> | undefined,
    /** @default {Boolean(options.charMap)} */
    stripUnknown: boolean,
    /** @default {Boolean(options.charMap || options.stripUnknown)} */
    stripDiacritics: boolean,
}

export function slugify(input: string, options?: Partial<SlugifyOptions>): string

// slugify_char_map.ts

// A comprehensive char mapping (transliteration) from some decently authoritative source
export const charMap = {
    // ...
    я: "ya",
    // ...
    鼎: "ding",
    // ...
}

If you really want to opt-in to the "nuke everything other than Basic Latin" option for some reason, you could still do that with slugify(..., { stripUnknown: true }) or slugify(..., { charMap: {} }).

As for "decently authoritative source" for the char map, I'm not sure what that would be. https://unicode-org.github.io/icu/userguide/transforms/general/ provides some notes on transliteration, which suggest that a simple charMap isn't really sufficient, but it looks like implementing proper transliteration is pretty complicated, so a char map could end up being the least-worst option (other than the actual least-worst option, which is just relying on percent-encoding to do its thing 😜)

iuioiua commented 2 weeks ago

I'm happy to defer to the consensus of others with this issue, but if we go down the route of having a character map, best it be a Map<string, string>.

lionel-rowe commented 1 week ago

Looks like the requisite ICU data for transliteration is here: https://github.com/unicode-org/icu/blob/main/icu4c/source/data/translit/. With some truly disgusting regex-based """parsing""" of the data files, Intl.Segmenter-based word segmentation, and some gibberish strings of words in various languages, I can get decent-ish results (this uses ~500kb of very un-optimized mapping data for the char map):

[vi]
Bằng khác byte phần bảng ký sun hợp của tự.
bang-khac-byte-phan-bang-ky-sun-hop-cua-tu

[zh]
解决始终这些统一，大部分，既成事实节字体编码的。
jiejue-shizhong-zhexie-tongyi-dabufen-chengshishi-jie-ziti-bianma-de

[de]
Ging loslassen Steuerzeichen in auf übersetzt, Phaistos Pau und Ugaritisch.
ging-loslassen-steuerzeichen-in-auf-ubersetzt-phaistos-pau-und-ugaritisch

[es]
Modificaciones equivalencia que esquemas alquímicos ha bits vez de una.
modificaciones-equivalencia-que-esquemas-alquimicos-ha-bits-vez-de-una

[ru]
Бит данных частично характерный текст, на в с клавиатуры из.
bit-dannyx-chastichno-xarakternyi-tekst-na-v-s-klaviatury-iz

[ar]
رمز والأسلوب العالم، متناسقة يتكون الراغبون، التفريق كترميزات في النصوص.
rmz-walaslwb-alalm-mtnasqh-ytkwn-alraghbwn-altfryq-ktrmyzat-fy-alnsws

[ja]
韓国利用運用文字のをれドキュメントマッピングと。
hanguo-liyong-yunyong-wenzi-no-o-re-tokyumentomahin-ku-to

[el]
Υπολογιστή ωστόσο αφήνει όλες βασίζονται προβλήματα στο κωδικοποίησης την για.
ypologiste-ostoso-apheni-oles-basizondai-problemata-sto-kodikopoieses-ten-gia

Limitations: Japanese will always give bad results for Kanji; Arabic lacks most vowels (I think that's due to the vowels not being indicated in the first place, so no way round that); the Greek is currently based on Ancient Greek transliteration, but I think that can be fixed by ignoring certain input files.

lionel-rowe commented 1 week ago

OK, switched to using a custom peggy grammar to parse the ICU mappings, and now getting what seem to be decent results now for all the languages I've tested (as good/better than results from other general-purpose transliteration libraries I've looked at).

Comparison:

Language	Example	@std/slugify@next (with `charMap` option)	npm:slugify	npm:transliterate
Amharic	በርበሬን ከላመ ከሞተ አግኝተሽው ዋጥ ስልቅጥ አድርገሽ ከምኔው ጨረሽው	beribereni-kelame-kemote-ginyiteshiwi-wati-silikiti-dirigeshi-keminewi-chereshiwi		barebareene-kalaama-kamota-agenyetashewe-waathe-seleqethe-aderegashe-kameneewe-charashewe
Arabic	الحركة الدولية للدفاع عن الأطفال الفلسطينين ضد بايدن	alhrkh-aldwlyh-lldfa-n-alatfal-alflstynyn-zd-baydn	alhrkh-aldwlyh-lldfaa-an-alatfal-alflstynyn-dhd-baydn	lhrk-ldwly-lldfaa-aan-l-tfl-lflstynyn-dd-bydn
German	Leichtathletik-Weltmeisterschaften 2007/Teilnehmer (Liechtenstein)	leichtathletik-weltmeisterschaften-2007-teilnehmer-liechtenstein	Leichtathletik-Weltmeisterschaften-2007Teilnehmer-(Liechtenstein)	leichtathletik-weltmeisterschaften-2007-teilnehmer-liechtenstein
Greek	Βραβείο Καλύτερου Διευθυντή Φωτογραφίας της Ένωσης Διαδικτυακών Κριτικών Κινηματογράφου	vravio-kaliterou-dhievthindi-fotografias-tis-enosis-dhiadhiktiakon-kritikon-kinimatografou	Brabeio-Kalyteroy-Diey8ynth-Fwtografias-ths-Enwshs-Diadiktyakwn-Kritikwn-Kinhmatografoy	vraveio-kalyteroy-dieythynti-fotografias-tis-enosis-diadiktyakon-kritikon-kinimatografoy
Spanish	Temporada 2018 del Campeonato Brasileño de Motovelocidade	temporada-2018-del-campeonato-brasileno-de-motovelocidade	Temporada-2018-del-Campeonato-Brasileno-de-Motovelocidade	temporada-2018-del-campeonato-brasileno-de-motovelocidade
Hindi	संयुक्त अरब अमीरात क्रिकेट टीम का स्कॉटलैंड दौरा 2016	samyaukata-araba-amairaata-karaikaeta-taima-kaa-sakaotalaaimda-daauraa-2016	2016	snyukt-arb-amiiraat-krikett-ttiim-kaa-skonttlaindd-dauraa-2016
Icelandic	Alfreð Clausen syngur lög eftir Jenna Jónsson	alfred-clausen-syngur-log-eftir-jenna-jonsson	Alfred-Clausen-syngur-log-eftir-Jenna-Jonsson	alfred-clausen-syngur-log-eftir-jenna-jonsson
Japanese	コンティニュイング・ケア・リタイアメント・コミュニティ	konte-ni-ingu-kea-ritaiamento-komyunite		konteiniyuingukearitaiamentokomiyunitei
Russian	500 величайших альбомов всех времён по версии журнала Rolling Stone	500-velichayshix-albomov-vsex-vremyon-po-versii-jurnala-rolling-stone	500-velichajshih-albomov-vseh-vremyon-po-versii-zhurnala-Rolling-Stone	500-velichayshih-albomov-vseh-vremyon-po-versii-zhurnala-rolling-stone
Thai	จังหวัดมุกดาหารในการเลือกตั้งสมาชิกสภาผู้แทนราษฎรไทยเป็นการทั่วไป พ.ศ. 2562	canghwad-mukdahar-in-kar-eluxk-tang-smachik-spha-phu-aethn-radr-ithy-epnkar-thawip-ph-s-2562	..-2562	cchanghwadmukdaahaarainkaareluue-ktangsmaachiksphaaphuuaethnraasdraithyepnkaarthawaip-ph.s.-2562
Vietnamese	Cục Phát thanh, truyền hình và thông tin điện tử (Việt Nam)	cuc-phat-thanh-truyen-hinh-va-thong-tin-dien-tu-viet-nam	Cuc-Phat-thanh-truyen-hinh-va-thong-tin-djien-tu-(Viet-Nam)	cuc-phat-thanh-truyen-hinh-va-thong-tin-dien-tu-viet-nam
Chinese	2020年夏季奧林匹克運動會輕艇女子500公尺單人愛斯基摩艇比賽	2020-nian-xiaji-aolinpike-yundonghui-qing-ting-nuzi-500-gongchi-danren-aisijimo-ting-bisai	2020500	2020nian-xia-ji-ao-lin-pi-ke-yun-dong-hui-qing-ting-nu-zi-500gong-chi-dan-ren-ai-si-ji-mo-ting-bi-sai

npm:slugify gives empty results for Amharic, Hindi, Thai, Chinese, and Japanese, and also has some extremely questionable choices for Greek (θ → 8, ω → w, η → h). npm:transliterate gives more concise (possibly better?) results for Hindi but gives even less vowel-ey results for Arabic, lacks spacing for Japanese/Thai (I added ZWSPs to stop the table breaking the layout), and has suboptimal spacing for Chinese.

Char map is ~213KB (un-minified, un-gzipped).

IMO those are "good enough" results at this stage (given that the default will be not to transliterate), but it'd be good to get some input from speakers of a few more of these languages. You can also try it out with other languages here: https://dash.deno.com/playground/slugify

lionel-rowe commented 1 week ago

Upon testing more languages (list taken from npm:any-ascii's examples), we're still missing Braile (which I think is safe to omit as I think web content written in Braile must be vanishingly rare? Someone please correct if I'm wrong) and at least 3 South-East Asian languages (Burmese/Myanmar, Khmer, Lao). Also the Korean example looks a bit sus compared to the other versions. Also npm:any-ascii now seems to be best-in-class for JS transliteration, at least from the ones I've found.

kt3k commented 1 week ago

While I personally found these researches interesting, I think it's difficult to do these transliterations in an unopinionated way. Also it seems difficult to maintain them as the maintainers are not knowledgeable about many of these languages. I'd consider the handlings of non-latin alphabet languages are out of scope of this API.

lionel-rowe commented 1 week ago

I'd consider the handlings of non-latin alphabet languages are out of scope of this API.

@kt3k My main concern isn't that non-Latin script should have special handling, rather that it should be passed through rather than removed (and especially that it shouldn't be removed as a default option). It's worth mentioning that in its current state, slugify doesn't even handle fully-Latin-alphabet text properly — for example, various alphabetic chars like [ßĐæø] are removed (Blöße becomes bloe, Trần Hưng Đạo becomes tran-hung-ao, Nærøy becomes nry).

I only started looking into transliteration, which IMO is a less-good option compared to pass-through (not to mention significantly less common in-the-wild), as an alternative.

With all that said... I'm inclined to think you're probably right. Further, @lowlighter 's suggestion of a strip regex is a useful option after all, but with suggested regexes being exported from the package itself (roll-your-own is probably less useful).

With that option, you can easily implement pass-through (default), strip, strip-diacritics, or even strip-only-ascii-diacritics behavior. The regex would be run against the NFD form so it could easily deal with diacritics:

export const NON_WORD = /[^\p{L}\p{M}\p{N}\-]+/gu;
export const DIACRITICS = /[^\p{L}\p{N}\-]+/gu;
export const ASCII_DIACRITICS = /(?<=[a-zA-Z])\p{M}+|[^\p{L}\p{M}\p{N}\-]+/gu;
export const NON_ASCII = /[^0-9a-zA-Z\-]/g;

// NON_WORD
assertEquals(slugify("déjà-vu"), "déjà-vu");
assertEquals(slugify("Συστημάτων Γραφής"), "συστημάτων-γραφής");

assertEquals(slugify("déjà-vu", { strip: DIACRITICS }), "deja-vu");
assertEquals(slugify("Συστημάτων Γραφής", { strip: DIACRITICS }), "συστηματων-γραφης");

assertEquals(slugify("déjà-vu", { strip: ASCII_DIACRITICS }), "deja-vu");
assertEquals(slugify("Συστημάτων Γραφής", { strip: ASCII_DIACRITICS }), "συστημάτων-γραφής");

assertEquals(slugify("déjà-vu", { strip: NON_ASCII }), "deja-vu");
assertEquals(slugify("Συστημάτων Γραφής", { strip: NON_ASCII }), "-");

Further, you could easily use a third-party transliteration library along with strip: NON_ASCII:

import transliterate from 'npm:any-ascii'

assertEquals(slugify(transliterate("Συστημάτων Γραφής"), { strip: NON_ASCII }), "systimaton-grafis");

kt3k commented 1 week ago

Ah ok. strip option sounds good to me. Looks like a balanced solution between added complexity and practicality.