Jieiku / abridge

Fast & Lightweight Zola Theme
https://abridge.pages.dev/
MIT License
140 stars 39 forks source link

Investigate additional Search Engines. #178

Open Jieiku opened 4 days ago

Jieiku commented 4 days ago

elasticlunr was the first implemented in abridge because index generation was directly supported by Zola.

elasticlunr also supports CJK, stemmers, and stop words, so it is a good solution for a wide range of people.

I then implemented both tinysearch and stork, the demos are here:

https://jieiku.github.io/abridge-tinysearch/

https://jieiku.github.io/abridge-stork/

Those demos are static builds from an older version of abridge, I lost interest in stork because it actually used more bandwidth than elasticlunr. I am however interested in getting tinysearch working again.

Zola now supports building a json based index:

https://github.com/getzola/zola/pull/2507

https://www.getzola.org/documentation/content/search/#fuse

I think I may have looked at flexsearch but I cannot remember all the details, it has been a while, another one I am interested in is pagefind: https://github.com/CloudCannon/pagefind/issues/277

I opened a new issue at tinysearch, I don't have time to work on it at the moment: https://github.com/tinysearch/tinysearch/issues/178

https://leeoniya.github.io/uFuzzy/demos/compare.html?libs=uFuzzy,FlexSearch,Elasticlunr,Fuse&search=super%20ma

After looking here it looks like flexsearch supports stemmers and CJK https://github.com/nextapps-de/flexsearch/issues/51

I am not sure how automatic that support is, but it looks like flexsearch is worth looking into.

Hysterelius commented 4 days ago

I have been having a look at FlexSearch especially due to its faster speeds and lower bundle size, but I am still working out how the index is constructed and whether Zola could support it.

If that doesn't work, I am happy to try and fix tinysearch, but I have limited time currently so it might take me a while either way.

Jieiku commented 4 days ago

flexsearch is a smaller script, 5.87 kB instead of 18.05 kB for elasticlunr.

but elasticlunr loads faster, and uses less memory, which might be important on mobile devices.

The author of uFuzzy setup the benchmarks, and he admits he maybe does not have other search engines tuned perfectly, but that he made a best effort when documentation was available, you can see the full table at the bottom of this github page: https://github.com/leeoniya/uFuzzy

On the flipside, flexsearch performs a search in 3ms, and elasticlunr takes 14ms. but I think this difference is inconsequential compared to the enormous difference in load speed and memory usage.

elasticlunr uses 89 MB and takes 978ms to load flexsearch uses 336 MB and takes 3,088ms to load

here is elasticlunr: 2024-07-01_23-30-37

here is flexsearch: 2024-07-01_23-34-06

Hysterelius commented 4 days ago

Yeah, it is probably a bit pointless going chasing after those extra milliseconds especially on a static site and even more so if it increases load times.

Jieiku commented 4 days ago

I have been on the lookout for something that is better across the board...(mostly because elasticlunr is no longer maintained) pagefind is interesting because it chunks the index, so if you had a really enormous site with lots of articles, it would not download the entire index, only the chunks relevant to your search.... but I think until a site is pretty big that elasticlunr would actually perform better than pagefind, but is just my hunch without having actually tried pagefind.

Jieiku commented 4 days ago

All of that said, I like the idea of Abridge being flexible, so if anyone wants to submit a pull request to support a given search engine, I would likely add it as long as it didn't cause an unavoidable problem.

Hysterelius commented 4 days ago

I was looking at search engine that supports fuzzy matching (elasticlunr, tinysearcch and stork all don't seem to), to help provide a better user experience. That being said, pagefind looks to support this (and also supports indexing other files, like PDFs - which I have also been looking for) and I might have a go at seeing what I can do to support it.

I'm not sure how well I'll go with chunking and manipulating the index - but I'll give it a try :)

Hysterelius commented 4 days ago

Just checking what you think of this flow for the search process.

As I think it would have to pass through a node script on the build step which I guess is already done on the process for elasticlunr.

image

I don't know quite yet how to support multilingual, but at least pagefind seems to support it out of the box.

Jieiku commented 3 days ago

Yes that is exactly what I was thinking, I believe if we configured Zola to output a json index that we should be able to send it to pagefind through their node api using this method:

https://pagefind.app/docs/node-api/#indexaddcustomrecord

If the json format that zola outputs is missing some things we need then maybe we can look at the recently added fuse json output and create a pull request to zola adding an additional json index format that is compatible with pagefind.

Once the pagefind index is built I am guessing we would use this to save it:

https://pagefind.app/docs/node-api/#indexwritefiles

Hysterelius commented 3 days ago

It is pretty simple to both construct an index, however it does seem to have a heavy reliance on node (I tried to package it using esbuild - and it didn't like it).

I wrote this script based off this information: https://github.com/CloudCannon/pagefind/issues/277 I was just wondering where zola puts this intermediate data?

import * as pagefind from 'pagefind';

async function createIndex() {
    // Create a new Pagefind index
    const { index } = await pagefind.createIndex({
        forceLanguage: 'en', // Force the language to English
    });

    // Define your data
    const data = [{
        "title": "Abridge Zola Theme",
        "url": "https://abridge.netlify.app/overview-abridge/",
        "meta": "Abridge is a fast and lightweight Zola theme using semantic html, only ~6kb css before svg icons and syntax highlighting, no mandatory JS, and perfect…",
        "body": "Abridge is a fast and lightweight Zola theme using semantic html..."
    }, {
        "title": "Code Blocks and Themes",
        "url": "https://abridge.netlify.app/overview-code-blocks/",
        "meta": "This article shows various Code Blocks allowing to easily compare sublime themes.\n",
        "body": "This article shows various Code Blocks allowing to easily compare sublime themes..."
    }, {
        "title": "Markdown and Style Guide",
        "url": "https://abridge.netlify.app/overview-markdown-and-style/",
        "meta": "This article offers a sample of basic Markdown syntax that can be used in Zola content files, also it shows if basic HTML elements are decorated with …",
        "body": "This article offers a sample of basic Markdown syntax that can be used in Zola content files, also it shows if basic HTML elements are decorated with CSS in a Zola theme..."
    }, {
        "title": "Image Shortcodes",
        "url": "https://abridge.netlify.app/overview-images/",
        "meta": "This post covers the imghover and img shortcodes. Images can also be embedded directly using markdown ![Ferris](ferris.svg), but it is better to use a…",
        "body": "This post covers the imghover and img shortcodes. Images can also be embedded directly using markdown..."
    }, {
        "title": "Rich Content",
        "url": "https://abridge.netlify.app/overview-rich-content/",
        "meta": "Several custom shortcodes are included to augment CommonMark (courtesy of d3c3nt theme), in addition to those already provided by Zola. video, image, …",
        "body": "Several custom shortcodes are included to augment CommonMark (courtesy of d3c3nt theme), in addition to those already provided by Zola. video, image, gif,..."
    }, {
        "title": "Embedded Youtube Videos",
        "url": "https://abridge.netlify.app/overview-embedded-youtube/",
        "meta": "Zola has many shortcodes, and new are easily added, this example shows youtube.\n",
        "body": "Zola has many shortcodes, and new are easily added, this example shows youtube.\nYoutube\nwith yt(id="the_id_here")\n\nid: the video id (mandatory)\nplaylist: the playlist id (optional)\nclass: a class to add to the <div> surrounding the iframe (optional)\nautoplay: when set to "true", the video autoplays on load (optional)\ntitle - set alt title for the iframe (optional, defaults to "Youtube")\ncookie - set to "true" if you want tracking cookies, otherwise it defaults to false.\n\n\n\t\n\n"
    }, {
        "title": "Embedded Vimeo Videos",
        "url": "https://abridge.netlify.app/overview-embedded-vimeo/",
        "meta": "Zola has many shortcodes, and new are easily added, this example shows vimeo.\n",
        "body": "Zola has many shortcodes, and new are easily added, this example shows vimeo.\nVimeo\nwith vm(id="id_here")\n\nid: the video id (mandatory)\nclass: a class to add to the <div> surrounding the iframe (optional)\nautoplay: when set to "true", the video autoplays on load (optional)\nloop: when set to "true", the video plays on a loop (optional)\nnoautopause: when set to "true", the video will not autopause (optional)\ntitle - set alt title for the iframe (optional, defaults to "Vimeo")\ncookie - set to "true" if you want tracking cookies, otherwise it defaults to false.\n\n\n\t\n\n"
    }, {
        "title": "Mathematical Notations",
        "url": "https://abridge.netlify.app/overview-math/",
        "meta": "You can use KaTeX to render mathematical notations.\nYou can enable the $\\KaTeX$ support globally, per-section or per-page basis.\n",
        "body": "You can use KaTeX to render mathematical notations.\nYou can enable the $\\KaTeX$ support globally, per-section or per-page basis.\nEnable..."
    }];

    // Add each record to the index
    for (const record of data) {
        await index.addCustomRecord({
            url: record.url,
            content: record.body,
            language: 'en',
            meta: {
                title: record.title,
                description: record.meta,
            }
        });
    }

    // Write the index files to disk
    await index.writeFiles({
        outputPath: 'public/pagefind'
    });

    console.log('Index created successfully!');
}

createIndex().catch(console.error);

Then search is even easier (it handles the chunking for you!), it could easily be put in:

  async function search() {
    const pagefind = await import("./public/pagefind/pagefind.js");
    pagefind.init();
    const search = await pagefind.search("zola");
    const oneResult = await search.results[0].data();
    console.log(oneResult);
  }

  search();

The index is spits out is pretty small only 127B for that test data, yet pagefind itself it pretty big ~32kB not including the WASM giving a whole bundle size around 100kB which is massive compared to elasticlunr bundle size (18kB - which I could tell). But the searches are very fast :)

I am just a bit confused, as this would require users to rerun the index build process on every zola build to enable search but is that what abridge is already designed to do?

Jieiku commented 3 days ago

That is what I was thinking, yet pagefind will still pull ahead for sites with a ton of content, because as you add content the elasticlunr index gets bigger and bigger.

yes, every time you do a zola build, it generates a new index, that much is true of elasticlunr, tinysearch, stork, etc.

Hysterelius commented 3 days ago

That is what I was thinking, yet pagefind will still pull ahead for sites with a ton of content, because as you add content the elasticlunr index gets bigger and bigger.

Does elasticlunr include all those lunr files I see in public/js? As that would lead to a dramatically different size.

And pagefind wasn't bundled, so I am guessing you could get some savings off that.

yes, every time you do a zola build, it generates a new index, that much is true of elasticlunr, tinysearch, stork, etc.

Cool! I am just not quite sure how to hook into that.

Jieiku commented 3 days ago

I would configure your config.toml to output the json format:

https://www.getzola.org/documentation/content/search/#fuse

# config.toml
[search]
index_format = "fuse_json"

after you do that just issue a zola build and take a look in the public folder for the json index. (look it over to see if it will be compatible)

Then for your script that you wrote, you could wrap it up in a function within package_abridge.js and call your function after the first zola build (zola build gets called twice within package_abridge.js, once to generate the index, then after minifying the js files zola build is ran a second time to update the integrity hashes for the newly minified files.)

EDIT: Currently you have some static data within your function to feed into pagefind for the index for testing purposes const data =, instead you would use a node module to parse the json index that zola outputs when you configure it as index_format = "fuse_json" and feed that to pagefind instead.

If you look in package_abridge.js you will see that I check the values of many things in config.toml to handle logic. You will find in config.toml search_library = 'elasticlunr' under config.extra we can update the readme for that section to also include search_library = 'pagefind' and when it is set that way we would call your function within package_abridge.js

EDIT: elasticlunr does NOT load all those js files, those files are for other languages, so they only get used on pages that use other languages. (in google chrome you can press ctrl+shift+i and load the abridge demo and search for something in the searchbox and see exactly which files get loaded for elasticlunr)

EDIT: ah yes package_abridge.js can be used to minify and bundle any of the js files that pagefind uses and it should save some space.

2024-07-02_22-03-26

or in firefox, which I prefer:

2024-07-02_22-06-13

Hysterelius commented 1 day ago

I was just wondering if you think it is better for the users to install pagefind as a node package or do you want it bundled in static/js?

Jieiku commented 1 day ago

because node is required to build the index, we might as well just have it as a dependency that gets installed as a node package....

Any javascript for the client side search can of coarse go into static/js but anything related to building the index can just be installed as a node package.