Closed hamano closed 1 month ago
๐ This isn't an error I would expect to see, due to the way these WebAssembly modules are bundled, and the location of that function.
My first guess would be that you have the prior version of the Japanese WebAssembly cached. Could you try the following?
/pagefind/
directory from your siteLet me know if you're still seeing the issue after those steps.
Thank you for your response. As you pointed out, the error was resolved with a hard reload of the browser. However, the issue with the ranking persists, so I would appreciate any advice you can provide.
Creating an index for English provides the expected ranking.
$ pagefind_extended --force-language en
But, when creating an index for Japanese, the order does not match the expected one.
$ pagefind_extended --force-language ja
Since the number of word hits is noticeably low, this might not be an issue with ranking customization, but rather with the Japanese word segmentation. Is there a way to debug the results of word segmentation in detail? Any advice on this would be greatly appreciated.
Interesting! If you have a test page to share I'm happy to help look into it :)
Is there a way to debug the results of word segmentation in detail?
Currently you can look at the zero-width space characters in the raw_content
field returned with the Pagefind fragment. For extended languages, Pagefind doesn't split on standard whitespace, and instead splits on these zero-width spaces that it inserts after segmentation.
For a quick example, you can replace the \u200B
zero-width space character and log the result, e.g.:
result.raw_content.replace(/\u200B/g, '๐')
Which will output something like (testing on https://starlight.astro.build/ja/):
Starlight๐ใทใงใผใฑใผใน๐. ๐่ชๅ๐ใฎ๐ใใฎ๐ใ๐่ฟฝๅ ๐ใใ๐ใ๐๏ผ ๐Starlight๐ใง๐ใตใคใ๐ใ๐ไฝๆ๐ใ๐ใพใ๐ใ๐ใ๐๏ผ๐ใใฎ๐ใใผใธ๐ใซ๐ใชใณใฏ๐ใ๐่ฟฝๅ ๐ใใ๐PR๐ใ๐ไฝๆ๐ใ๐ใพใใ๐ใ๐๏ผ ๐ใตใคใ๐. ๐Starlight๐ใฏ๐ใใงใซ๐ๆฌ็ช๐็ฐๅข๐ใง๐ไฝฟ็จ๐ใ๐ใ๐ใฆ๐ใ๐ใพใ๐ใ๐ไปฅไธ๐ใฏ๐ใ๐ใฆใงใ๐ไธ๐ใฎ๐ใใใค๐ใ๐ใฎ๐ใตใคใ๐ใงใ๐ใ ๐Athena ๐OS๐. ๐PubIndexAPI ๐Docs๐. ๐pls๐. ๐capo.js๐. ๐Web ๐Monetization ๐API๐. ๐QBCore ๐Docs๐. ๐har.fyi๐. ๐xs๐-๐dev ๐docs๐. ๐Felicity๐. ๐NgxEditor๐. ๐Astro ๐Error ๐Pages๐. ๐Terrateam ๐Docs๐. ๐simple๐-๐fm๐. ๐Obytes ๐Starter๐. ๐Kanri๐. ๐VRCFR ๐Creator๐. ๐Refact๐. ๐Some ๐drops ๐of ๐PHP ๐Book๐. ๐Nostalgist.js๐. ๐AI ๐Prompt ๐Snippets๐. ๐Folks ๐Router๐. ๐React ๐Awesome ๐Reveal๐. ๐Ethereum ๐Follow ๐Protocol๐. ๐Knip๐. ๐secco๐. ๐SiteOne ๐Crawler๐. ๐csmos๐. ๐TanaFlows ๐Docs๐. ๐Concepto ๐AI๐. ๐Mr๐. ๐Robรธt๐. ๐Open ๐SaaS ๐Docs๐. ๐Astro ๐Snipcart๐. ๐Astro๐-๐GhostCMS๐. ๐oneRepo๐. ๐Flojoy๐. ๐AstroNvim๐. ๐ScreenshotOne ๐Docs๐. ๐DipSway๐. ๐RunsOn๐. ๐SudoVanilla๐. ๐SST ๐Ion๐. ๐Font ๐Awesome๐. ๐Starlight๐ใ๐ไฝฟ็จ๐ใ๐ใฆ๐ใใ๐ใใใชใใฏ๐ใช๐ใใญใธใงใฏใ๐ใฎ๐GitHub๐ใชใใธใใช๐ใ๐็ขบ่ช๐ใ๐ใฆ๐ใฟ๐ใฆ๐ใใ ใใ๐ใ
With that you can see how the words were segmented โ this only works for the "extended" languages such as ja
/ zh
.
It seems that there is indeed an issue with the segmentation of Japanese. Here is an example of such content:
<span>OpenSSL</span>
<span>OpenSsl</span>
Creating an index for this content with --force-language en and searching for "ssl" yields the expected 2 hits.
words: (2) [0, 1]
word_count: 2
raw_content: "OpenSSL OpenSsl"
However, with --force-language ja, it is segmented as follows, and only "OpenSsl" hits for "ssl".
words: [5]
word_count: 6
raw_content.replace(/\u200B/g, '|'): "Open|S|S|L |Open|Ssl"
It appears that words like "OpenSSL" are not being correctly segmented.
@bglw There was something concerning about the words you provided as an example.
๐Astro๐-๐GhostCMS๐
<span>Astro-GhostCMS</span>
This content is segmented in my environment as follows:
|Astro|-|Ghost|C|M|S|
What kind of differences in the environment are there?
It seems this is an issue caused by charabia.
main.rs:
use std::env;
use charabia::Segment;
fn main() {
let arg = env::args().nth(1).unwrap();
let segments = arg.as_str().segment_str().collect::<Vec<&str>>().join("|");
println!("{}", segments)
}
$ cargo run "OpenSSL: Cryptography and SSL/TLS Toolkit"
Open|S|S|L|:| |Cryptography| |and| |SSL|/|TLS| |Toolkit
$ cargo run "OpenSSLใฏๆๅทๅใจSSL/TLSใฎ็บใฎใใผใซใญใใใงใใ"
Open|S|S|L|ใฏ|ๆๅท|ๅ|ใจ|SSL|/|TLS|ใฎ|็บ|ใฎ|ใใผใซ|ใญใใ|ใงใ|ใ
I've conducted some inspection on charabia. It seems that the latin-camelcase feature seems to be detrimental.
default feature
$ cargo run "OpenSSL OpenSsl openSsl open_ssl"
Open|S|S|L| |Open|Ssl| |open|Ssl| |open|_|ssl
disable all feature
$ cargo run "OpenSSL OpenSsl openSsl open_ssl"
OpenSSL| |OpenSsl| |openSsl| |open|_|ssl
I think words like camelCase in japanese sentence are proper nouns, so there's no reason to split them. Therefore, I propose disabling the default feature and enabling only Japanese and Chinese.
enable only chinese and japanese feature
$ cargo run "OpenSSL OpenSsl openSsl open_ssl"
OpenSSL| |OpenSsl| |openSsl| |open|_|ssl
$ cargo run "OpenSSLใฏๆๅทๅใจSSL/TLSใฎ็บใฎใใผใซใญใใใงใใ"
OpenSSL|ใฏ|ๆๅท|ๅ|ใจ|SSL|/|TLS|ใฎ|็บ|ใฎ|ใใผใซ|ใญใใ|ใงใ|ใ
Released in v1.1.1 ๐
Thank you for the release of v1.1.0. I was looking forward to the ranking customize feature. However, it seems to not be working in Japanese. When I set the ranking options and perform a search on content with lang="ja", the following error occurs, and the ranking options are not reflected.
No error is output when --force-language en is specified, and the ranking behaves as expected. Any advice would be greatly appreciated. Kind regards.