getzola / zola

A fast static site generator in a single binary with everything built-in. https://www.getzola.org
https://www.getzola.org
MIT License
13.76k stars 957 forks source link

Issues with building search index contains Chinese contents #1102

Closed liushuyu closed 4 years ago

liushuyu commented 4 years ago

Bug Report

Environment

Zola version: 0.11.0 (from crates.io)

Expected Behavior

The search index builds correctly with no error since the underlying crate (elasticlunr-rs) supports Chinese through the de-facto Jieba segmentation library.

Current Behavior

It does not work when the website has multiple Chinese languages (Simplified and Traditional) and fails with

Tried to build search index for language zh-cn which is not supported

The root cause is that upstream implementation recognizes the Chinese as a whole (and their implementation works for common variants of Chinese) and the only language code it accepts is zh.

Step to reproduce

Put the following into config.toml:

# The URL the site will be built for
base_url = "https://example.com"

# Whether to automatically compile all Sass files in the sass directory
compile_sass = true

# Whether to do syntax highlighting
# Theme can be customized by setting the `highlight_theme` variable to a theme supported by Zola
highlight_code = false

# Whether to build a search index to be used later on by a JavaScript library
build_search_index = true

theme = "book"

languages = [
  {code = "zh-cn", search = true},
  {code = "zh-tw", search = true},
]

[extra]
# Put all your custom variables here

Extra notes

I didn't open a PR since this may need some discussion as I can see Zola thrives for cleaner implementations for everything.

My suggested solution would be having extra handling in https://github.com/getzola/zola/blob/97e772868d8892874cfb825c1acd789d9ad725f3/components/search/src/lib.rs#L32-L48.

The extra handling could be just simply stripping out the variant suffix like zh-cn => zh.

If you think the extra string manipulation would hurt the performance, then it could be gated behind a feature switch; and if you think this should be better resolved on the upstream, then I can open an issue on the upstream repository as well.

Keats commented 4 years ago

Sadly, I actually had to remove Chinese & Japanese support in the next branch for the search index generation as it was causing the binary size to inflate a lot (or not build at all due to asking too much RAM). I'm happy to turn them back on if we find a way to not end up with a 80mb+ binary size though.

liushuyu commented 4 years ago

Sadly, I actually had to remove Chinese & Japanese support in the next branch for the search index generation as it was causing the binary size to inflate a lot (or not build at all due to asking too much RAM). I'm happy to turn them back on if we find a way to not end up with an 80mb+ binary size though.

Sorry for the late reply, I have done some testing just now. It seems like the crate lindera used by elasticlunr-rs is using an embedded ~70 MB data file for some reason (I guess it's either a trained data or vocabulary), and the Chinese segmentation library Jieja is only taking up ~5 MB in release build.

I think in this case, you can gate them behind a feature switch so that if someone want to use the search indexing function for those languages, they can easily build a version with these supports enabled.

Keats commented 4 years ago

I think in this case, you can gate them behind a feature switch so that if someone want to use the search indexing function for those languages, they can easily build a version with these supports enabled.

Can you do a PR by any chance? I'll take one