denoland / std

The Deno Standard Library
https://jsr.io/@std
MIT License
3k stars 602 forks source link

`text/slugify` gives empty results for non-Latin alphabets #5830

Open lionel-rowe opened 3 weeks ago

lionel-rowe commented 3 weeks ago

Describe the bug

slugify gives readable results for only a subset of languages, namely those that use the Latin alphabet (English, Spanish, Vietnamese...). For all other languages (Chinese, Arabic, Russian...), it just returns an empty string.

Steps to Reproduce

slugify('三人行,必有我师焉') // returns ""

Expected behavior

slugify to return non-Latin-alphabet words unchanged (but still stripping start/end punctuation and replacing medial punctuation with dashes).

Something like the following:

function intlFriendlySlugify(str: string) {
    return str
        .toLowerCase()
        .normalize()
        .replaceAll(/[^\p{L}\p{M}\p{N}]+/gu, '-')
        .replaceAll(/^-|-$/g, '')
}

IMO the slugify function should not handle percent encoding, instead leaving that up to the calling code (e.g. passing to URL constructor or URLSearchParams):

const slug = intlFriendlySlugify('三人行,必有我师焉') // "三人行-必有我师焉"
const origin = 'https://example.com'
const url1 = new URL(origin)
url1.pathname = join('blog', slug)
const url2 = new URL(origin)
url2.searchParams.set('q', slug)
const url3 = new URL(origin)
url3.hostname = `${slug}.example.com`

While it's true they're unreadable in the percent-encoded/punycode form, modern browsers display the human-readable form automatically on hover, or in the address bar when you open them. At some point it might also be worth adding a prettifyUrl function to allow similar conversion in userland, but I'm leaving that out of scope for this issue.

Environment

OS: Ubuntu 20.04, WSL deno version: 1.46.0 std version: text@1.0.4

lionel-rowe commented 2 weeks ago

Here's a rundown of how various platforms handle non-ASCII text in slugs:

Site Diacritics Non-Latin Notes
stackoverflow.com cartão de credito → cartão-de-credito
Unchanged
Python で特定の文字 → python-で特定の文字
Unchanged
wikipedia.org Maria Angélica Beraldo → Maria_​Angélica_​Beraldo
Unchanged
佩通坦·钦那瓦 → 佩通坦·钦那瓦
Unchanged
Wikipedia uses _ instead of - for spaces, but it's still a slug of sorts
tumblr.com wüst → wüst
Unchanged
Word of the Day: 久违 (Chinese) → word-of-the-day-久违-chinese
Unchanged
wordpress.org Actualización de mantenimiento → actualizacion-de-mantenimiento
Stripped
WordPress 6.6「Dorsey」发布 → wordpress-6-6dorsey发布
Unchanged
github.com Introducción → introducción
Unchanged
7. 兩岸詞典 /c/ → 7-兩岸詞典-c
Unchanged
Slugified section titles in README files in URL hash
medium.com Cómo salirse -> como-salirse
Stripped
跟我学中文! -> 跟我学中文
Unchanged
dev.to 2 años como Front-End Developer → 2-anos-como-front-end-developer
Stripped
データ・ストリーミング技術の概要 → deta​sutorimingu​ji-shu-no​gai-yao
Transliterated
This illustrates some of the problems with stripping diacritics and transliterating — 2 años means "2 years" whereas 2-anos means 2-anuses; meanwhile 技術 should be gijutsu in Japanese, not ji-shu

The transliteration option has one big advantage, namely that the URL remains legible in any context: plaintext files, IM platforms with limited rich-text features, etc. It also typically leads to shorter URLs compared to the percent-encoded version. Still, it's strictly worse when viewed in a browser address bar, adds a massive amount of complexity, including mappings for thousands of CJK characters, and often still leads to suboptimal results (as seen in the dev.to examples). That's probably why only dev.to uses it out of the 7 platforms I looked at.

As for diacritics, 3 of the 7 platforms strip them from Latin-script text, while the other 4 keep them. As with transliteration, stripping leads to more plaintext-friendly URLs; however, diacritics can be semantically important, also illustrated by the dev.to example.

timreichen commented 2 weeks ago

In the initial implementation it was discussed to port npm:slugify, so slugify('三人行,必有我师焉') === "" is actually expected.

The behavior you describe is probably better handled with @std/text/to-kebab-case

import { toKebabCase } from "@std/text/to-kebab-case";
console.log(toKebabCase("三人行-必有我师焉")); // "三人行-必有我师焉"
lionel-rowe commented 2 weeks ago

@timreichen In the state it was merged, that PR isn't a port of npm:slugify, because it doesn't include any char mapping. But more to the point, why is it more important to have parity with a specific NPM package rather than to have slugify be a general-purpose function for creating slugs, as implemented by sites like WordPress, Stack Overflow, Wikipedia, GitHub, Medium, and Tumblr? npm:slugify has multiple open issues about its lack of support for various languages.

And if toKebabCase already works reliably as a general-purpose slugify function (which I'm not sure about, haven't tested it), why expose a separate slugify that only works for a subset of use cases?

timreichen commented 2 weeks ago

@timreichen In the state it was merged, that PR isn't a port of npm:slugify, because it doesn't include any char mapping. But more to the point, why is it more important to have parity with a specific NPM package rather than to have slugify be a general-purpose function for creating slugs, as implemented by sites like WordPress, Stack Overflow, Wikipedia, GitHub, Medium, and Tumblr? npm:slugify has multiple open issues about its lack of support for various languages.

We removed the char mapping because the list was random. The problem is as you pointed out that there is no standard and the slugify functionality varies depending on the implementation.

And if toKebabCase already works reliably as a general-purpose slugify function (which I'm not sure about, haven't tested it), why expose a separate slugify that only works for a subset of use cases?

Every slugify function will be only work on a subset of use cases. That is why for example npm:slugify has so many options and one can add custom replacements etc. I think if there is a clean way to support other languages, that should be added. However, the slug must match [a-zA-Z0-9-]*.

lionel-rowe commented 2 weeks ago

However, the slug must match [a-zA-Z0-9-]*.

Why? Again, massive platforms like WordPress, Stack Overflow, Wikipedia, GitHub, Medium, and Tumblr don't obey that rule, and browsers and web APIs handle non-ASCII URL components perfectly fine. Allowing "" as a slug is far more risky sanitization-wise than allowing arbitrary non-ASCII text, because the path /a//b normalizes to /a/b (and additionally, /a/ is often normalized to /a). Meanwhile, non-ASCII text can never clash with reserved characters, which always fall within the printable ASCII range.

lowlighter commented 2 weeks ago

Maybe we could just change the signature of slugify so users can provide their own strip regex ?

function slugify(input: string, strip = /[^a-zA-Z0-9\s-]/g): string

This way it doesn't really add much more complexity while offering a bit more liberty to end users (which would know best which charset they'd like to support) ?

slugify("déjà-vu", /[^a-zA-Z0-9\s-À-ÖØ-öø-ÿ]/g) // "déjà-vu" 
lionel-rowe commented 2 weeks ago

@lowlighter That seems to me like it's simultaneously too granular and not customizable enough. Too granular because I can't see any good reason why you'd want to allow some non-ASCII but not others; not customizable enough because it still doesn't provide any way of mapping.

Something like this could work:

// slugify.ts

export type SlugifyOptions = {
    /** @default {undefined} */
    charMap: Record<string, string> | undefined,
    /** @default {Boolean(options.charMap)} */
    stripUnknown: boolean,
    /** @default {Boolean(options.charMap || options.stripUnknown)} */
    stripDiacritics: boolean,
}

export function slugify(input: string, options?: Partial<SlugifyOptions>): string

// slugify_char_map.ts

// A comprehensive char mapping (transliteration) from some decently authoritative source
export const charMap = {
    // ...
    я: "ya",
    // ...
    鼎: "ding",
    // ...
}

If you really want to opt-in to the "nuke everything other than Basic Latin" option for some reason, you could still do that with slugify(..., { stripUnknown: true }) or slugify(..., { charMap: {} }).

As for "decently authoritative source" for the char map, I'm not sure what that would be. https://unicode-org.github.io/icu/userguide/transforms/general/ provides some notes on transliteration, which suggest that a simple charMap isn't really sufficient, but it looks like implementing proper transliteration is pretty complicated, so a char map could end up being the least-worst option (other than the actual least-worst option, which is just relying on percent-encoding to do its thing 😜)

iuioiua commented 2 weeks ago

I'm happy to defer to the consensus of others with this issue, but if we go down the route of having a character map, best it be a Map<string, string>.

lionel-rowe commented 1 week ago

Looks like the requisite ICU data for transliteration is here: https://github.com/unicode-org/icu/blob/main/icu4c/source/data/translit/. With some truly disgusting regex-based """parsing""" of the data files, Intl.Segmenter-based word segmentation, and some gibberish strings of words in various languages, I can get decent-ish results (this uses ~500kb of very un-optimized mapping data for the char map):

[vi]
Bằng khác byte phần bảng ký sun hợp của tự.
bang-khac-byte-phan-bang-ky-sun-hop-cua-tu

[zh]
解决始终这些统一,大部分,既成事实节字体编码的。
jiejue-shizhong-zhexie-tongyi-dabufen-chengshishi-jie-ziti-bianma-de

[de]
Ging loslassen Steuerzeichen in auf übersetzt, Phaistos Pau und Ugaritisch.
ging-loslassen-steuerzeichen-in-auf-ubersetzt-phaistos-pau-und-ugaritisch

[es]
Modificaciones equivalencia que esquemas alquímicos ha bits vez de una.
modificaciones-equivalencia-que-esquemas-alquimicos-ha-bits-vez-de-una

[ru]
Бит данных частично характерный текст, на в с клавиатуры из.
bit-dannyx-chastichno-xarakternyi-tekst-na-v-s-klaviatury-iz

[ar]
رمز والأسلوب العالم، متناسقة يتكون الراغبون، التفريق كترميزات في النصوص.
rmz-walaslwb-alalm-mtnasqh-ytkwn-alraghbwn-altfryq-ktrmyzat-fy-alnsws

[ja]
韓国利用運用文字のをれドキュメントマッピングと。
hanguo-liyong-yunyong-wenzi-no-o-re-tokyumentomahin-ku-to

[el]
Υπολογιστή ωστόσο αφήνει όλες βασίζονται προβλήματα στο κωδικοποίησης την για.
ypologiste-ostoso-apheni-oles-basizondai-problemata-sto-kodikopoieses-ten-gia

Limitations: Japanese will always give bad results for Kanji; Arabic lacks most vowels (I think that's due to the vowels not being indicated in the first place, so no way round that); the Greek is currently based on Ancient Greek transliteration, but I think that can be fixed by ignoring certain input files.

lionel-rowe commented 1 week ago

OK, switched to using a custom peggy grammar to parse the ICU mappings, and now getting what seem to be decent results now for all the languages I've tested (as good/better than results from other general-purpose transliteration libraries I've looked at).

Comparison:

Language Example @std/slugify@next (with charMap option) npm:slugify npm:transliterate
Amharic በርበሬን ከላመ ከሞተ አግኝተሽው ዋጥ ስልቅጥ አድርገሽ ከምኔው ጨረሽው beribereni-kelame-kemote-ginyiteshiwi-wati-silikiti-dirigeshi-keminewi-chereshiwi barebareene-kalaama-kamota-agenyetashewe-waathe-seleqethe-aderegashe-kameneewe-charashewe
Arabic الحركة الدولية للدفاع عن الأطفال الفلسطينين ضد بايدن alhrkh-aldwlyh-lldfa-n-alatfal-alflstynyn-zd-baydn alhrkh-aldwlyh-lldfaa-an-alatfal-alflstynyn-dhd-baydn lhrk-ldwly-lldfaa-aan-l-tfl-lflstynyn-dd-bydn
German Leichtathletik-Weltmeisterschaften 2007/Teilnehmer (Liechtenstein) leichtathletik-weltmeisterschaften-2007-teilnehmer-liechtenstein Leichtathletik-Weltmeisterschaften-2007Teilnehmer-(Liechtenstein) leichtathletik-weltmeisterschaften-2007-teilnehmer-liechtenstein
Greek Βραβείο Καλύτερου Διευθυντή Φωτογραφίας της Ένωσης Διαδικτυακών Κριτικών Κινηματογράφου vravio-kaliterou-dhievthindi-fotografias-tis-enosis-dhiadhiktiakon-kritikon-kinimatografou Brabeio-Kalyteroy-Diey8ynth-Fwtografias-ths-Enwshs-Diadiktyakwn-Kritikwn-Kinhmatografoy vraveio-kalyteroy-dieythynti-fotografias-tis-enosis-diadiktyakon-kritikon-kinimatografoy
Spanish Temporada 2018 del Campeonato Brasileño de Motovelocidade temporada-2018-del-campeonato-brasileno-de-motovelocidade Temporada-2018-del-Campeonato-Brasileno-de-Motovelocidade temporada-2018-del-campeonato-brasileno-de-motovelocidade
Hindi संयुक्त अरब अमीरात क्रिकेट टीम का स्कॉटलैंड दौरा 2016 samyaukata-araba-amairaata-karaikaeta-taima-kaa-sakaotalaaimda-daauraa-2016 2016 snyukt-arb-amiiraat-krikett-ttiim-kaa-skonttlaindd-dauraa-2016
Icelandic Alfreð Clausen syngur lög eftir Jenna Jónsson alfred-clausen-syngur-log-eftir-jenna-jonsson Alfred-Clausen-syngur-log-eftir-Jenna-Jonsson alfred-clausen-syngur-log-eftir-jenna-jonsson
Japanese コンティニュイング・ケア・リタイアメント・コミュニティ konte-ni-ingu-kea-ritaiamento-komyunite konteiniyuingukeari​taiamentokomiyunitei
Russian 500 величайших альбомов всех времён по версии журнала Rolling Stone 500-velichayshix-albomov-vsex-vremyon-po-versii-jurnala-rolling-stone 500-velichajshih-albomov-vseh-vremyon-po-versii-zhurnala-Rolling-Stone 500-velichayshih-albomov-vseh-vremyon-po-versii-zhurnala-rolling-stone
Thai จังหวัดมุกดาหารในการเลือกตั้งสมาชิกสภาผู้แทนราษฎรไทยเป็นการทั่วไป พ.ศ. 2562 canghwad-mukdahar-in-kar-eluxk-tang-smachik-spha-phu-aethn-radr-ithy-epnkar-thawip-ph-s-2562 ..-2562 cchanghwadmukdaa​haarainkaareluue-ktangsmaachiksphaa​phuuaethnraasdrai​thyepnkaarthawaip-ph.s.-2562
Vietnamese Cục Phát thanh, truyền hình và thông tin điện tử (Việt Nam) cuc-phat-thanh-truyen-hinh-va-thong-tin-dien-tu-viet-nam Cuc-Phat-thanh-truyen-hinh-va-thong-tin-djien-tu-(Viet-Nam) cuc-phat-thanh-truyen-hinh-va-thong-tin-dien-tu-viet-nam
Chinese 2020年夏季奧林匹克運動會輕艇女子500公尺單人愛斯基摩艇比賽 2020-nian-xiaji-aolinpike-yundonghui-qing-ting-nuzi-500-gongchi-danren-aisijimo-ting-bisai 2020500 2020nian-xia-ji-ao-lin-pi-ke-yun-dong-hui-qing-ting-nu-zi-500gong-chi-dan-ren-ai-si-ji-mo-ting-bi-sai

npm:slugify gives empty results for Amharic, Hindi, Thai, Chinese, and Japanese, and also has some extremely questionable choices for Greek (θ → 8, ω → w, η → h). npm:transliterate gives more concise (possibly better?) results for Hindi but gives even less vowel-ey results for Arabic, lacks spacing for Japanese/Thai (I added ZWSPs to stop the table breaking the layout), and has suboptimal spacing for Chinese.

Char map is ~213KB (un-minified, un-gzipped).

IMO those are "good enough" results at this stage (given that the default will be not to transliterate), but it'd be good to get some input from speakers of a few more of these languages. You can also try it out with other languages here: https://dash.deno.com/playground/slugify

lionel-rowe commented 1 week ago

Upon testing more languages (list taken from npm:any-ascii's examples), we're still missing Braile (which I think is safe to omit as I think web content written in Braile must be vanishingly rare? Someone please correct if I'm wrong) and at least 3 South-East Asian languages (Burmese/Myanmar, Khmer, Lao). Also the Korean example looks a bit sus compared to the other versions. Also npm:any-ascii now seems to be best-in-class for JS transliteration, at least from the ones I've found.

kt3k commented 1 week ago

While I personally found these researches interesting, I think it's difficult to do these transliterations in an unopinionated way. Also it seems difficult to maintain them as the maintainers are not knowledgeable about many of these languages. I'd consider the handlings of non-latin alphabet languages are out of scope of this API.

lionel-rowe commented 1 week ago

I'd consider the handlings of non-latin alphabet languages are out of scope of this API.

@kt3k My main concern isn't that non-Latin script should have special handling, rather that it should be passed through rather than removed (and especially that it shouldn't be removed as a default option). It's worth mentioning that in its current state, slugify doesn't even handle fully-Latin-alphabet text properly — for example, various alphabetic chars like [ßĐæø] are removed (Blöße becomes bloe, Trần Hưng Đạo becomes tran-hung-ao, Nærøy becomes nry).

I only started looking into transliteration, which IMO is a less-good option compared to pass-through (not to mention significantly less common in-the-wild), as an alternative.

With all that said... I'm inclined to think you're probably right. Further, @lowlighter 's suggestion of a strip regex is a useful option after all, but with suggested regexes being exported from the package itself (roll-your-own is probably less useful).

With that option, you can easily implement pass-through (default), strip, strip-diacritics, or even strip-only-ascii-diacritics behavior. The regex would be run against the NFD form so it could easily deal with diacritics:

export const NON_WORD = /[^\p{L}\p{M}\p{N}\-]+/gu;
export const DIACRITICS = /[^\p{L}\p{N}\-]+/gu;
export const ASCII_DIACRITICS = /(?<=[a-zA-Z])\p{M}+|[^\p{L}\p{M}\p{N}\-]+/gu;
export const NON_ASCII = /[^0-9a-zA-Z\-]/g;

// NON_WORD
assertEquals(slugify("déjà-vu"), "déjà-vu");
assertEquals(slugify("Συστημάτων Γραφής"), "συστημάτων-γραφής");

assertEquals(slugify("déjà-vu", { strip: DIACRITICS }), "deja-vu");
assertEquals(slugify("Συστημάτων Γραφής", { strip: DIACRITICS }), "συστηματων-γραφης");

assertEquals(slugify("déjà-vu", { strip: ASCII_DIACRITICS }), "deja-vu");
assertEquals(slugify("Συστημάτων Γραφής", { strip: ASCII_DIACRITICS }), "συστημάτων-γραφής");

assertEquals(slugify("déjà-vu", { strip: NON_ASCII }), "deja-vu");
assertEquals(slugify("Συστημάτων Γραφής", { strip: NON_ASCII }), "-");

Further, you could easily use a third-party transliteration library along with strip: NON_ASCII:

import transliterate from 'npm:any-ascii'

assertEquals(slugify(transliterate("Συστημάτων Γραφής"), { strip: NON_ASCII }), "systimaton-grafis");
kt3k commented 1 week ago

Ah ok. strip option sounds good to me. Looks like a balanced solution between added complexity and practicality.