CloudCannon / pagefind

Static low-bandwidth search at scale
https://pagefind.app
MIT License
3.45k stars 111 forks source link

Wrong segmentation in Japanese #591

Closed hamano closed 1 month ago

hamano commented 6 months ago

Thank you for the release of v1.1.0. I was looking forward to the ranking customize feature. However, it seems to not be working in Japanese. When I set the ranking options and perform a search on content with lang="ja", the following error occurs, and the ranking options are not reflected.

Uncaught (in promise) TypeError: wasm.set_ranking_weights is not a function
    at __exports.set_ranking_weights (pagefind.js:1:2087)
    at PagefindInstance.set_ranking (pagefind.js:1:19537)
    at async PagefindInstance.init (pagefind.js:1:20135)
    at async Pagefind.init (pagefind.js:9:1384)

No error is output when --force-language en is specified, and the ranking behaves as expected. Any advice would be greatly appreciated. Kind regards.

bglw commented 6 months ago

๐Ÿ‘‹ This isn't an error I would expect to see, due to the way these WebAssembly modules are bundled, and the location of that function.

My first guess would be that you have the prior version of the Japanese WebAssembly cached. Could you try the following?

Let me know if you're still seeing the issue after those steps.

hamano commented 6 months ago

Thank you for your response. As you pointed out, the error was resolved with a hard reload of the browser. However, the issue with the ranking persists, so I would appreciate any advice you can provide.

Creating an index for English provides the expected ranking.

$ pagefind_extended --force-language en
  1. Pages that are longer and contain more keywords. score: 25.417072, words: (29) [1, 3, 13, 16, 20, 24, 33, 92, 101, 113, 118, 133, 158, 160, 232, 275, 354, 363, 374, 378, 393, 401, 409, 415, 419, 421, 423, 425, 427], word_count: 428
  2. Pages that are shorter and contain fewer keywords. score: 19.279419, words: (13) [0, 7, 11, 24, 25, 30, 36, 40, 45, 46, 47, 51, 53], word_count: 55

But, when creating an index for Japanese, the order does not match the expected one.

$ pagefind_extended --force-language ja
  1. Pages that are shorter and contain fewer keywords. score: 7.1849093, words: (8) [287, 290, 312, 331, 351, 405, 413, 419], word_count: 469
  2. Pages that are longer and contain more keywords. score: 2.3622553, words: [1633], word_count: 2485

Since the number of word hits is noticeably low, this might not be an issue with ranking customization, but rather with the Japanese word segmentation. Is there a way to debug the results of word segmentation in detail? Any advice on this would be greatly appreciated.

bglw commented 5 months ago

Interesting! If you have a test page to share I'm happy to help look into it :)

Is there a way to debug the results of word segmentation in detail?

Currently you can look at the zero-width space characters in the raw_content field returned with the Pagefind fragment. For extended languages, Pagefind doesn't split on standard whitespace, and instead splits on these zero-width spaces that it inserts after segmentation.

For a quick example, you can replace the \u200B zero-width space character and log the result, e.g.:

result.raw_content.replace(/\u200B/g, '๐Ÿ•')

Which will output something like (testing on https://starlight.astro.build/ja/):

Starlight๐Ÿ•ใ‚ทใƒงใƒผใ‚ฑใƒผใ‚น๐Ÿ•. ๐Ÿ•่‡ชๅˆ†๐Ÿ•ใฎ๐Ÿ•ใ‚‚ใฎ๐Ÿ•ใ‚’๐Ÿ•่ฟฝๅŠ ๐Ÿ•ใ—ใ‚ˆ๐Ÿ•ใ†๐Ÿ•๏ผ ๐Ÿ•Starlight๐Ÿ•ใง๐Ÿ•ใ‚ตใ‚คใƒˆ๐Ÿ•ใ‚’๐Ÿ•ไฝœๆˆ๐Ÿ•ใ—๐Ÿ•ใพใ—๐Ÿ•ใŸ๐Ÿ•ใ‹๐Ÿ•๏ผŸ๐Ÿ•ใ“ใฎ๐Ÿ•ใƒšใƒผใ‚ธ๐Ÿ•ใซ๐Ÿ•ใƒชใƒณใ‚ฏ๐Ÿ•ใ‚’๐Ÿ•่ฟฝๅŠ ๐Ÿ•ใ™ใ‚‹๐Ÿ•PR๐Ÿ•ใ‚’๐Ÿ•ไฝœๆˆ๐Ÿ•ใ—๐Ÿ•ใพใ—ใ‚‡๐Ÿ•ใ†๐Ÿ•๏ผ ๐Ÿ•ใ‚ตใ‚คใƒˆ๐Ÿ•. ๐Ÿ•Starlight๐Ÿ•ใฏ๐Ÿ•ใ™ใงใซ๐Ÿ•ๆœฌ็•ช๐Ÿ•็’ฐๅขƒ๐Ÿ•ใง๐Ÿ•ไฝฟ็”จ๐Ÿ•ใ•๐Ÿ•ใ‚Œ๐Ÿ•ใฆ๐Ÿ•ใ„๐Ÿ•ใพใ™๐Ÿ•ใ€‚๐Ÿ•ไปฅไธ‹๐Ÿ•ใฏ๐Ÿ•ใ€๐Ÿ•ใ‚ฆใ‚งใƒ–๐Ÿ•ไธŠ๐Ÿ•ใฎ๐Ÿ•ใ„ใใค๐Ÿ•ใ‹๐Ÿ•ใฎ๐Ÿ•ใ‚ตใ‚คใƒˆ๐Ÿ•ใงใ™๐Ÿ•ใ€‚ ๐Ÿ•Athena ๐Ÿ•OS๐Ÿ•. ๐Ÿ•PubIndexAPI ๐Ÿ•Docs๐Ÿ•. ๐Ÿ•pls๐Ÿ•. ๐Ÿ•capo.js๐Ÿ•. ๐Ÿ•Web ๐Ÿ•Monetization ๐Ÿ•API๐Ÿ•. ๐Ÿ•QBCore ๐Ÿ•Docs๐Ÿ•. ๐Ÿ•har.fyi๐Ÿ•. ๐Ÿ•xs๐Ÿ•-๐Ÿ•dev ๐Ÿ•docs๐Ÿ•. ๐Ÿ•Felicity๐Ÿ•. ๐Ÿ•NgxEditor๐Ÿ•. ๐Ÿ•Astro ๐Ÿ•Error ๐Ÿ•Pages๐Ÿ•. ๐Ÿ•Terrateam ๐Ÿ•Docs๐Ÿ•. ๐Ÿ•simple๐Ÿ•-๐Ÿ•fm๐Ÿ•. ๐Ÿ•Obytes ๐Ÿ•Starter๐Ÿ•. ๐Ÿ•Kanri๐Ÿ•. ๐Ÿ•VRCFR ๐Ÿ•Creator๐Ÿ•. ๐Ÿ•Refact๐Ÿ•. ๐Ÿ•Some ๐Ÿ•drops ๐Ÿ•of ๐Ÿ•PHP ๐Ÿ•Book๐Ÿ•. ๐Ÿ•Nostalgist.js๐Ÿ•. ๐Ÿ•AI ๐Ÿ•Prompt ๐Ÿ•Snippets๐Ÿ•. ๐Ÿ•Folks ๐Ÿ•Router๐Ÿ•. ๐Ÿ•React ๐Ÿ•Awesome ๐Ÿ•Reveal๐Ÿ•. ๐Ÿ•Ethereum ๐Ÿ•Follow ๐Ÿ•Protocol๐Ÿ•. ๐Ÿ•Knip๐Ÿ•. ๐Ÿ•secco๐Ÿ•. ๐Ÿ•SiteOne ๐Ÿ•Crawler๐Ÿ•. ๐Ÿ•csmos๐Ÿ•. ๐Ÿ•TanaFlows ๐Ÿ•Docs๐Ÿ•. ๐Ÿ•Concepto ๐Ÿ•AI๐Ÿ•. ๐Ÿ•Mr๐Ÿ•. ๐Ÿ•Robรธt๐Ÿ•. ๐Ÿ•Open ๐Ÿ•SaaS ๐Ÿ•Docs๐Ÿ•. ๐Ÿ•Astro ๐Ÿ•Snipcart๐Ÿ•. ๐Ÿ•Astro๐Ÿ•-๐Ÿ•GhostCMS๐Ÿ•. ๐Ÿ•oneRepo๐Ÿ•. ๐Ÿ•Flojoy๐Ÿ•. ๐Ÿ•AstroNvim๐Ÿ•. ๐Ÿ•ScreenshotOne ๐Ÿ•Docs๐Ÿ•. ๐Ÿ•DipSway๐Ÿ•. ๐Ÿ•RunsOn๐Ÿ•. ๐Ÿ•SudoVanilla๐Ÿ•. ๐Ÿ•SST ๐Ÿ•Ion๐Ÿ•. ๐Ÿ•Font ๐Ÿ•Awesome๐Ÿ•. ๐Ÿ•Starlight๐Ÿ•ใ‚’๐Ÿ•ไฝฟ็”จ๐Ÿ•ใ—๐Ÿ•ใฆ๐Ÿ•ใ„ใ‚‹๐Ÿ•ใƒ‘ใƒ–ใƒชใƒƒใ‚ฏ๐Ÿ•ใช๐Ÿ•ใƒ—ใƒญใ‚ธใ‚งใ‚ฏใƒˆ๐Ÿ•ใฎ๐Ÿ•GitHub๐Ÿ•ใƒชใƒใ‚ธใƒˆใƒช๐Ÿ•ใ‚’๐Ÿ•็ขบ่ช๐Ÿ•ใ—๐Ÿ•ใฆ๐Ÿ•ใฟ๐Ÿ•ใฆ๐Ÿ•ใใ ใ•ใ„๐Ÿ•ใ€‚

With that you can see how the words were segmented โ€” this only works for the "extended" languages such as ja / zh.

hamano commented 5 months ago

It seems that there is indeed an issue with the segmentation of Japanese. Here is an example of such content:

<span>OpenSSL</span>
<span>OpenSsl</span>

Creating an index for this content with --force-language en and searching for "ssl" yields the expected 2 hits.

words: (2) [0, 1]
word_count: 2
raw_content: "OpenSSL OpenSsl"

However, with --force-language ja, it is segmented as follows, and only "OpenSsl" hits for "ssl".

words: [5]
word_count: 6
raw_content.replace(/\u200B/g, '|'): "Open|S|S|L |Open|Ssl"

It appears that words like "OpenSSL" are not being correctly segmented.

hamano commented 5 months ago

@bglw There was something concerning about the words you provided as an example.

๐Ÿ•Astro๐Ÿ•-๐Ÿ•GhostCMS๐Ÿ•
<span>Astro-GhostCMS</span>

This content is segmented in my environment as follows:

|Astro|-|Ghost|C|M|S|

What kind of differences in the environment are there?

hamano commented 5 months ago

It seems this is an issue caused by charabia.

main.rs:

use std::env;
use charabia::Segment;

fn main() {
    let arg = env::args().nth(1).unwrap();
    let segments = arg.as_str().segment_str().collect::<Vec<&str>>().join("|");
    println!("{}", segments)
}
$ cargo run "OpenSSL: Cryptography and SSL/TLS Toolkit"
Open|S|S|L|:| |Cryptography| |and| |SSL|/|TLS| |Toolkit
$ cargo run "OpenSSLใฏๆš—ๅทๅŒ–ใจSSL/TLSใฎ็‚บใฎใƒ„ใƒผใƒซใ‚ญใƒƒใƒˆใงใ™ใ€‚"
Open|S|S|L|ใฏ|ๆš—ๅท|ๅŒ–|ใจ|SSL|/|TLS|ใฎ|็‚บ|ใฎ|ใƒ„ใƒผใƒซ|ใ‚ญใƒƒใƒˆ|ใงใ™|ใ€‚
hamano commented 5 months ago

I've conducted some inspection on charabia. It seems that the latin-camelcase feature seems to be detrimental.

default feature

$ cargo run "OpenSSL OpenSsl openSsl open_ssl"
Open|S|S|L| |Open|Ssl| |open|Ssl| |open|_|ssl

disable all feature

$ cargo run "OpenSSL OpenSsl openSsl open_ssl"
OpenSSL| |OpenSsl| |openSsl| |open|_|ssl

I think words like camelCase in japanese sentence are proper nouns, so there's no reason to split them. Therefore, I propose disabling the default feature and enabling only Japanese and Chinese.

enable only chinese and japanese feature

$ cargo run "OpenSSL OpenSsl openSsl open_ssl"
OpenSSL| |OpenSsl| |openSsl| |open|_|ssl

$ cargo run "OpenSSLใฏๆš—ๅทๅŒ–ใจSSL/TLSใฎ็‚บใฎใƒ„ใƒผใƒซใ‚ญใƒƒใƒˆใงใ™ใ€‚"
OpenSSL|ใฏ|ๆš—ๅท|ๅŒ–|ใจ|SSL|/|TLS|ใฎ|็‚บ|ใฎ|ใƒ„ใƒผใƒซ|ใ‚ญใƒƒใƒˆ|ใงใ™|ใ€‚
bglw commented 1 month ago

Released in v1.1.1 ๐Ÿ™‚