Wrong segmentation in Japanese

hamano commented 6 months ago

Thank you for the release of v1.1.0. I was looking forward to the ranking customize feature. However, it seems to not be working in Japanese. When I set the ranking options and perform a search on content with lang="ja", the following error occurs, and the ranking options are not reflected.

Uncaught (in promise) TypeError: wasm.set_ranking_weights is not a function
    at __exports.set_ranking_weights (pagefind.js:1:2087)
    at PagefindInstance.set_ranking (pagefind.js:1:19537)
    at async PagefindInstance.init (pagefind.js:1:20135)
    at async Pagefind.init (pagefind.js:9:1384)

No error is output when --force-language en is specified, and the ranking behaves as expected. Any advice would be greatly appreciated. Kind regards.

bglw commented 6 months ago

👋 This isn't an error I would expect to see, due to the way these WebAssembly modules are bundled, and the location of that function.

My first guess would be that you have the prior version of the Japanese WebAssembly cached. Could you try the following?

Delete your output directory (e.g. remove the /pagefind/ directory from your site
Re-run the latest Pagefind
Load in your browser and hard-refresh
Alt: Load in your browser in private or incognito

Let me know if you're still seeing the issue after those steps.

hamano commented 6 months ago

Thank you for your response. As you pointed out, the error was resolved with a hard reload of the browser. However, the issue with the ranking persists, so I would appreciate any advice you can provide.

Creating an index for English provides the expected ranking.

$ pagefind_extended --force-language en

Pages that are longer and contain more keywords. score: 25.417072, words: (29) [1, 3, 13, 16, 20, 24, 33, 92, 101, 113, 118, 133, 158, 160, 232, 275, 354, 363, 374, 378, 393, 401, 409, 415, 419, 421, 423, 425, 427], word_count: 428
Pages that are shorter and contain fewer keywords. score: 19.279419, words: (13) [0, 7, 11, 24, 25, 30, 36, 40, 45, 46, 47, 51, 53], word_count: 55

But, when creating an index for Japanese, the order does not match the expected one.

$ pagefind_extended --force-language ja

Pages that are shorter and contain fewer keywords. score: 7.1849093, words: (8) [287, 290, 312, 331, 351, 405, 413, 419], word_count: 469
Pages that are longer and contain more keywords. score: 2.3622553, words: [1633], word_count: 2485

Since the number of word hits is noticeably low, this might not be an issue with ranking customization, but rather with the Japanese word segmentation. Is there a way to debug the results of word segmentation in detail? Any advice on this would be greatly appreciated.

bglw commented 5 months ago

Interesting! If you have a test page to share I'm happy to help look into it :)

Is there a way to debug the results of word segmentation in detail?

Currently you can look at the zero-width space characters in the raw_content field returned with the Pagefind fragment. For extended languages, Pagefind doesn't split on standard whitespace, and instead splits on these zero-width spaces that it inserts after segmentation.

For a quick example, you can replace the \u200B zero-width space character and log the result, e.g.:

result.raw_content.replace(/\u200B/g, '🍕')

Which will output something like (testing on https://starlight.astro.build/ja/):

Starlight🍕ショーケース🍕. 🍕自分🍕の🍕もの🍕を🍕追加🍕しよ🍕う🍕！ 🍕Starlight🍕で🍕サイト🍕を🍕作成🍕し🍕まし🍕た🍕か🍕？🍕この🍕ページ🍕に🍕リンク🍕を🍕追加🍕する🍕PR🍕を🍕作成🍕し🍕ましょ🍕う🍕！ 🍕サイト🍕. 🍕Starlight🍕は🍕すでに🍕本番🍕環境🍕で🍕使用🍕さ🍕れ🍕て🍕い🍕ます🍕。🍕以下🍕は🍕、🍕ウェブ🍕上🍕の🍕いくつ🍕か🍕の🍕サイト🍕です🍕。 🍕Athena 🍕OS🍕. 🍕PubIndexAPI 🍕Docs🍕. 🍕pls🍕. 🍕capo.js🍕. 🍕Web 🍕Monetization 🍕API🍕. 🍕QBCore 🍕Docs🍕. 🍕har.fyi🍕. 🍕xs🍕-🍕dev 🍕docs🍕. 🍕Felicity🍕. 🍕NgxEditor🍕. 🍕Astro 🍕Error 🍕Pages🍕. 🍕Terrateam 🍕Docs🍕. 🍕simple🍕-🍕fm🍕. 🍕Obytes 🍕Starter🍕. 🍕Kanri🍕. 🍕VRCFR 🍕Creator🍕. 🍕Refact🍕. 🍕Some 🍕drops 🍕of 🍕PHP 🍕Book🍕. 🍕Nostalgist.js🍕. 🍕AI 🍕Prompt 🍕Snippets🍕. 🍕Folks 🍕Router🍕. 🍕React 🍕Awesome 🍕Reveal🍕. 🍕Ethereum 🍕Follow 🍕Protocol🍕. 🍕Knip🍕. 🍕secco🍕. 🍕SiteOne 🍕Crawler🍕. 🍕csmos🍕. 🍕TanaFlows 🍕Docs🍕. 🍕Concepto 🍕AI🍕. 🍕Mr🍕. 🍕Robøt🍕. 🍕Open 🍕SaaS 🍕Docs🍕. 🍕Astro 🍕Snipcart🍕. 🍕Astro🍕-🍕GhostCMS🍕. 🍕oneRepo🍕. 🍕Flojoy🍕. 🍕AstroNvim🍕. 🍕ScreenshotOne 🍕Docs🍕. 🍕DipSway🍕. 🍕RunsOn🍕. 🍕SudoVanilla🍕. 🍕SST 🍕Ion🍕. 🍕Font 🍕Awesome🍕. 🍕Starlight🍕を🍕使用🍕し🍕て🍕いる🍕パブリック🍕な🍕プロジェクト🍕の🍕GitHub🍕リポジトリ🍕を🍕確認🍕し🍕て🍕み🍕て🍕ください🍕。

With that you can see how the words were segmented — this only works for the "extended" languages such as ja / zh.

hamano commented 5 months ago

It seems that there is indeed an issue with the segmentation of Japanese. Here is an example of such content:

<span>OpenSSL</span>
<span>OpenSsl</span>

Creating an index for this content with --force-language en and searching for "ssl" yields the expected 2 hits.

words: (2) [0, 1]
word_count: 2
raw_content: "OpenSSL OpenSsl"

However, with --force-language ja, it is segmented as follows, and only "OpenSsl" hits for "ssl".

words: [5]
word_count: 6
raw_content.replace(/\u200B/g, '|'): "Open|S|S|L |Open|Ssl"

It appears that words like "OpenSSL" are not being correctly segmented.

hamano commented 5 months ago

@bglw There was something concerning about the words you provided as an example.

🍕Astro🍕-🍕GhostCMS🍕

<span>Astro-GhostCMS</span>

This content is segmented in my environment as follows:

|Astro|-|Ghost|C|M|S|

What kind of differences in the environment are there?

hamano commented 5 months ago

It seems this is an issue caused by charabia.

main.rs:

use std::env;
use charabia::Segment;

fn main() {
    let arg = env::args().nth(1).unwrap();
    let segments = arg.as_str().segment_str().collect::<Vec<&str>>().join("|");
    println!("{}", segments)
}

$ cargo run "OpenSSL: Cryptography and SSL/TLS Toolkit"
Open|S|S|L|:| |Cryptography| |and| |SSL|/|TLS| |Toolkit

$ cargo run "OpenSSLは暗号化とSSL/TLSの為のツールキットです。"
Open|S|S|L|は|暗号|化|と|SSL|/|TLS|の|為|の|ツール|キット|です|。

hamano commented 5 months ago

I've conducted some inspection on charabia. It seems that the latin-camelcase feature seems to be detrimental.

default feature

$ cargo run "OpenSSL OpenSsl openSsl open_ssl"
Open|S|S|L| |Open|Ssl| |open|Ssl| |open|_|ssl

disable all feature

$ cargo run "OpenSSL OpenSsl openSsl open_ssl"
OpenSSL| |OpenSsl| |openSsl| |open|_|ssl

I think words like camelCase in japanese sentence are proper nouns, so there's no reason to split them. Therefore, I propose disabling the default feature and enabling only Japanese and Chinese.

enable only chinese and japanese feature

$ cargo run "OpenSSL OpenSsl openSsl open_ssl"
OpenSSL| |OpenSsl| |openSsl| |open|_|ssl

$ cargo run "OpenSSLは暗号化とSSL/TLSの為のツールキットです。"
OpenSSL|は|暗号|化|と|SSL|/|TLS|の|為|の|ツール|キット|です|。

bglw commented 1 month ago

Released in v1.1.1 🙂

CloudCannon / pagefind

Wrong segmentation in Japanese #591