Search improvement - Githubissues

Keats commented 4 years ago

See https://zola.discourse.group/t/search-improvement/344/2

We still to discuss what improvements make sense but imagine you have a site with thousands of pages; adding all of that to the search index will result in a huge JS file that is not usable. Being able to select which field would help for example.

jasikpark commented 4 years ago

might look at https://github.com/jameslittle230/stork

i only recommend it because it looks cool and #rust

but it also seems promising!!

_{Sent with GitHawk}

TheSirC commented 4 years ago

Would Tantivy be a possibility for the search ?

It is what powers the search of lib.rs. It is written in Rust and might allow easy integration and, it seems, pretty good performance.

As far as my search goes, the website seems to implement its search logic here, if that can help for reference.

jasikpark commented 4 years ago

It is closer to Apache Lucene than to Elasticsearch or Apache Solr in the sense it is not an off-the-shelf search engine server, but rather a crate that can be used to build such a search engine.

— Tantivy

Stork is built with Rust, and the Javascript library uses WebAssembly behind the scenes. It’s built with content creators in mind, in that it requires little-to-no code to get started and can be extended deeply. It’s perfect for JAMstack sites and personal blogs, but can be used wherever you need a search interface.

— Stork

Based on the goals of the projects, I feel like stork might be more user-friendly for static sites, though Tantivy might offer better opportunity to integrate with Zola +Tera templates?

Idk what people really need in terms of customization though.

_{Sent with GitHawk}

Keats commented 4 years ago

I've received an email of someone having a fork with Tantivy for search with Zola but I don't have their GH handle if they have one :(

To test which search engine/lib to use, the main thing to test is how good are the results. I don't really care about whether it's in Rust or JS like right now, as long as the results are decent and that you can do a usable search input with a little bit of JS

Th3Whit3Wolf commented 4 years ago

@Keats I found them here.

Edit: Here's his article if anyone was interested in it. Another Edit: update first link from his github to self hosted gitea which has more commits.

Th3Whit3Wolf commented 4 years ago

Does there have to be only option? Could a user choose which one through a flag, config.toml, or even a build feature?

Tantivy and stork look like they have two different but very valid use cases for a workflow that includes zola.

Keats commented 4 years ago

Tantity requires a backend so the output becomes something different from a static site. Zola is a SSG so it won't get an official option that is not completely static.

Th3Whit3Wolf commented 4 years ago

Could zola support a generic index building? With Tantivy you can load schema from json so if zola could take a config file or option where a user can describe what information you want to be indexed and output a json with that information that would be super helpful. Similarly stork asks you to store that information in a toml file and stork will build and index from that information. It would be nice if you could also optionally ask zola to run some command after it indexes on build as well.

Stork asks that you run . . .

stork --build index.toml

to build the index. Obviously since stork is written in rust you could pull in some of it's source code to build it as a part of zola but I think having the ability to run a single shell command would be more flexible for a wider array of search engines.

Keats commented 4 years ago

Another request: https://github.com/getzola/zola/issues/975

Th3Whit3Wolf commented 4 years ago

I made a fork that is based of off zola 10.1 with the changes made by Jonathon Strong here that allow zola to build a tantivy index.

It adds a new subcommand index that takes a n index type -t that can be either elasticlunr or tantivy and optionally an output directory for the index (defaults to public

Here is the output of zola index -h

Create a search index as a stand-alone task, and with additional options

USAGE:
    zola-tantivy index [FLAGS] [OPTIONS] --index-type <index_type>

FLAGS:
        --drafts     Include drafts when loading the site
    -h, --help       Prints help information
    -V, --version    Prints version information

OPTIONS:
    -t, --index-type <index_type>    what kind of search index to build [possible values: elasticlunr, tantivy]
    -o, --output-dir <output_dir>    Outputs the generated search index files into the provided dir. Note: Tantivy
                                     indexing produces a directory instead of a file, which will be located at output-
                                     dir/tantivy-index [default: public]

I know @Keats said

Tantity requires a backend so the output becomes something different from a static site. Zola is a SSG so it won't get an official option that is not completely static.

So I was unsure if this pull request would be wanted but I have been using Jonathon Strong's fork for awhile and it feels rude to not send a pull request.

Please let me know what you think.

As a side note I have also made a server, you can check it out here, that can

query tantivy index
send the user to a random page by reading the sitemap zola produces

It uses Actix-web for the web server and Tera for templates so it's pretty easy to make templates based of your zola templates.

For anyone that is self hosting their zola website and wants to use tantivy for their search.

Keats commented 4 years ago

It can be mentioned in the docs but won't be accepted as a PR

maxdobeck commented 4 years ago

Lunr has worked well for me in the past. https://lunrjs.com/ Seems to be maintained still but hard to tell

Keats commented 4 years ago

Zola uses elasticlunr which is based on lunr

thomasantony commented 1 year ago

Any chance this could be used to build a search system for zola?

https://phiresky.github.io/blog/2021/hosting-sqlite-databases-on-github-pages/

Jieiku commented 1 year ago

I am not sure, but the preface of that article is maintain a server or download the WHOLE dataset.

In abridge I implemented:

elasticlunr(zola default): https://abridge.netlify.app/

stork: https://jieiku.github.io/abridge-stork/

tinysearch: https://jieiku.github.io/abridge-tinysearch/

Which was discussed here: https://github.com/Jieiku/abridge/issues/41

One user pointed out pagefind. I found the way this one works interesting because it does NOT download the entire dataset, rather it downloads a chunk relevant to your search term, which of coarse on an enormous dataset could save bandwidth.

I bookmarked the sqlite database article you linked because it is interesting, but my blogs dataset is nowhere near large enough to worry about it yet.

Jieiku commented 1 year ago

I almost forgot, I also made a JS facade for the search, this stops the index from being downloaded until a user clicks into the search box, it was discussed here: https://github.com/Jieiku/abridge/issues/81

thomasantony commented 1 year ago

I am not sure, but the preface of that article is maintain a server or download the WHOLE dataset.

That's what they are trying to avoid in the article. I believe the method that they use involves directly querying a SQLite database hosted on a static host from the browser. They use a WASM implementation of SQLite with some kind of chunked data access over HTTP to do it. They have live examples in that article where you can query a ~600Mb database with just kilobytes of data transfer.

To quote the article:

The total amount of data in the indicator_search FTS table is around 8 MByte. The above query should only fetch around 70 KiB. You can see how it is constructed here.

Jieiku commented 1 year ago

Pagefind also does some kind of chunked data access.

https://pagefind.app/ demo: https://xkcd.pagefind.app/

edit, abridge pagefind integration demo: https://abridge-pagefind.pages.dev/

I have no idea if Pagefind or the sqlite method you linked make more efficient use of bandwidth, if somebody implemented both and compared them I would find that a very interesting read.

gedw99 commented 6 months ago

SurrealDB could be a good option because it can run as wasm in the browser or on a server.

https://github.com/surrealdb/surrealdb.wasm

It has full text search built in. It's a fork of tatty I believe. https://docs.surrealdb.com/docs/reference-guide/full-text-search/

Jieiku commented 6 months ago

@gedw99 that is seriously cool!

gedw99 commented 6 months ago

@gedw99 that is seriously cool!

thanks. It's not bad is it. It's got lots more too. AI Vector engine, real time sync, rpc, etc.

getzola / zola

Search improvement #961