getzola / zola

A fast static site generator in a single binary with everything built-in. https://www.getzola.org
https://www.getzola.org
MIT License
13.44k stars 941 forks source link

Missing Cachebust when using index_format = "elasticlunr_json", language code is hard coded, and will not work from subdomain #2167

Open Jieiku opened 1 year ago

Jieiku commented 1 year ago

Bug Report

A new json index_format was added: https://github.com/getzola/zola/pull/1998

When using the new json format for the search index index_format = "elasticlunr_json" the cachebust is missing.

If you add new posts, repeat visitors may not have those posts in their index if the browser still has the old index cached.

The relevant line is 149 here: https://github.com/getzola/zola/blob/master/docs/static/search.js#L149

https://github.com/getzola/zola/blob/8ae4c623f24d3e7af14e3e94f92fcbcceb954bc5/docs/static/search.js#L147-L158

Because the index is fetched from the search.js file, zola would need to write to this search.js file to add the hash cachebust to the fetch line. I can think of some fairly simple ways to do this with regex. Zola would need to know ahead of time which js file handles the search, for me this is always search.js at the root level.

Another issue that I thought about was that the language code is hard coded. One possible solution would be to have the search.js file check the language code from the page source <html lang="en-gb"> and then fetch the corresponding search index. I can probably submit a pull request for this later today if this sounds like a good solution.

Another issue is that this fetch line grabs the json index from the root, this will be an issues for sites that reside in a subdomain, eg: github.io/mysite this is because this fetch will try to grab the resource from github.io/search_index.en.json when it should grab it from github.io/mysite/search_index.en.json

One way of resolving this is for the site to have set the base meta tag, then have the search.js file check this tag while forming the fetch url, I do exactly that for the old js index+search bundle: https://github.com/Jieiku/abridge/blob/master/static/search_facade.js (the same principle could be applied here)

It would require less js DOM access if we simply used the base_url defined in config.toml to form the fetch url, this would resolve the issue of using subdomains, we could do the same with the language code. (meaning dont do this in js, just have zola handle these values in addition to the cachebust.)

Environment

Zola version: 0.17.1

Expected Behavior

cachebust hash added, and a way to facilitate more than one language code.

Current Behavior

no cachebust, hard coded language code.

Step to reproduce

The search here can be used to reproduce: https://www.getzola.org/documentation/getting-started/overview/ I am also currently refactoring abridge and have it implemented there (refactor branch is messy, still work in progress): https://github.com/Jieiku/abridge/tree/refactor

Jieiku commented 1 year ago

I am motivated to work on this, just waiting for a little free time. hopefully within the next week or two.

Keats commented 1 year ago

For the cachebust, how do you see it working? I don't really want people to have to manually update the hash of the imported filename everytime in the template they add/touch a post.

Jieiku commented 1 year ago

First add a couple variables to search.js:

var base_url="";
var sha256="";

These variable would be used as part of the fetch.

so instead of this:

index = fetch("/search_index.en.json")

we would have this:

index = fetch( base_url + '/search_index.' + lang + '.json?h=' + sha256 )

Next add a field to config.toml

add a field under the [search] section so zola knows which path/file to update the hash and base_url in:

[search]
search_js = "search.js"

Then whenever we do a zola build or zola serve zola can automatically update the two variables in the search.js file, should be easy enough with a regex based search/replace (implemented in rust as part of zola) that looks for the base_url and sha256 variables in the search_js file and updates them during a zola build or zola serve

technically, there would be a different hash per language index, but I do not really think that is what is important, we can just use the hash of the primary language index because they all get generated at the same time anyway.

This solution would address both the base_url as well as the hash. As you may have noticed it does not address the language code.

I was thinking about this and I think it might be better to grab the language code dynamically in js instead of at build time, this is because sites can be multilingual. This would not be difficult so long as the site in question has the language code in the opening html tag, as your supposed to anyway.

<html lang="en-US">

The search.js would access the dom and look at the html tag for the language code, and just use a slice to grab the first 2 characters of the lang value, and use that value to load the appropriate index. (en, fr, etc...)

Otherwise if a user visits your site, and they come in on the english site, and the search.js file gets loaded, then when they switch languages we would have to load a different search.js for a different language, by accessing the DOM for the language code we can stick to a single search.js file.

I can update the Docs, etc as well when it is implemented.

I have become a lot more comfortable with rust, but I still have to refer to the rust docs from time to time, especially on code that I did not write myself because it will often use features I have made less use of because I am still new to rust.

Questions

I could definitely use feedback on the language code.

Do you think its better to have a single search.js and determine the language code from from the DOM, and load it accordingly.

Or would it be better to have a separate search.js file per language, and each search.js file would load its own language json index and hash.

Keats commented 1 year ago

Looking at the proposed solution, i think it's almost better to have people copy the hash manually tbh. An alternative would be to always cachebust by appending a random string as a query param to the url loading the index but that's not super efficient. In practice the index is big enough that you probably only want to use it for small enough sites where the index would not be too big.

Jieiku commented 1 year ago

What about the proposed solution do you not like? Is it that I am proposing we update values in an existing js file?

(I don't like the idea of using a random string at all, waste of bandwidth, I would never use it if it did that.)

Also I was considering eventually implementing pagefind I think it might be a good solution for larger sites.

demo: https://pagefind.app/ video: https://www.youtube.com/watch?v=74lsEXqRQys

edit: Apparently pagefind runs after the SSG builds the site... so it would be handled independently of zola, I am not sure what this will entail in practice, but I am interested in trying it with abridge, once I do I will document the steps.

I think for any site under 1,000 posts its probably better to use elasticlunr or tinysearch, because then the entire index is loaded, and so search is instantaneous, but once a site gets to a certain point in size I think pagefind would make a lot of sense.

Jieiku commented 1 year ago

I thought of another solution, it would involve being able to create a template for a javascript file, and zola being able to write the javascript file.

so basically you would create ~/zola_theme/templates/search.js The bulk of your code would be in there, basically static code, the only thing zola would need to do is update the fetch line which would be easy? (use the built in get_url function)

so in the search.js template file, instead of this:

index = fetch("/search_index.en.json")

we would have this:

index = fetch( {{ get_url(path=search_index.en.json, trailing_slash=false, cachebust=true) | safe }} )

This is assuming zola can create a javascript file from a template as easily as its able to create an html file from a template...

One obvious catch with this is that Zola would need to build the html pages and json index first, then the js files. (at least this is what I am assuming, in order to get the hash of the json index)

edit1: This would have a secondary benefit as well... Some sites will use a facade to load js on demand, when it is actually needed. which is what I do here: https://github.com/Jieiku/abridge/blob/master/static/search_facade.js

If Zola was able to create js files from templates then you could actually get the hash of Other js files for doing on demand loading like this. (currently I have been using npm to update these hashes, which is less than ideal)

edit2: Oh my.... This would have a third really NEAT benefit as well... you could literally structure a bundle js file, AS A JS TEMPLATE... (Remember I asked about bundling js files in the past?)

In a JS template we could check config.toml for feature flags to determine whether or not to include a block of js code in the output js file..... It would not be as compact as what you get from uglifyjs but it would work.

Keats commented 1 year ago

Bundling JS is out of scope, there's a lot to it involved. We can easily template a JS file though. It's a bit weird to run search on the generated output of a SSG for pagefind, you might want to fine tune what you're including rather than just looking at HTML.

I'm still thinking about to handle that, it's not easy!

bglw commented 1 year ago

👋 Popping in to drop a few thoughts!

edit: Apparently pagefind runs after the SSG builds the site... so it would be handled independently of zola, I am not sure what this will entail in practice, but I am interested in trying it with abridge, once I do I will document the steps.

One thing I'm working on right now is a Node.js API which can take in raw content or files and build an index, which allows Pagefind to be integrated into the development server of SSGs. It also allows you to pass direct records in, rather than indexing HTML. Since Pagefind is a binary under the hood, this is actually a generic stdio/out communication system that could be re-implemented fairly trivially from any language.

But also, Pagefind is Rust-based, so it's totally within reason to expose a lib interface for other Rust packages 👀

I think for any site under 1,000 posts its probably better to use elasticlunr or tinysearch, because then the entire index is loaded, and so search is instantaneous, but once a site gets to a certain point in size I think pagefind would make a lot of sense.

Pagefind should ideally expose a pagefind.loadAll() function that you can call if you think you have a reasonable amount of content to load up front, at which point search would be instant 🤔

Jieiku commented 1 year ago

We can easily template a JS file though.

That is Awesome! I want to try it out! (I am going to work on this in a couple days, it would simplify Abridge so very much)

But also, Pagefind is Rust-based, so it's totally within reason to expose a lib interface for other Rust packages

That sounds Awesome!

Pagefind should ideally expose a pagefind.loadAll() function that you can call if you think you have a reasonable amount of content to load up front, at which point search would be instant

Sounds like a great idea! I hope pagefind would also have a buildAll, so that it would build a single chunk. The reason being is if a site is still reasonably small and you have decided to just fetch the entire index, then it would be best in my opinion to have it be a single chunk, this would reduce the number of server requests to 1. Basically if my index is 400kb or less then I would rather just fetch a single index file, then if the index ever grows beyond that I would rather chunk it.

Keats commented 1 year ago

We can template it but not right now, since Zola is only loading .html files.

Jieiku commented 1 year ago

Yes, I understood that, planning to add the ability to load js files, or at least try to, unless you beat me to it.

Jieiku commented 1 year ago

I just tested this using a .html and .md file, the output was exactly correct other than having a .html file extension.

I performed a Diff on the original search.js and the one that zola generated from a template, and it was a perfect match other than there is now the cachebust hash which was generated without issue.

I am going to look at the Zola code now and try to find the spot where it loads .html file templates and see if I can add js as a valid template extension.

Keats commented 1 year ago

That's an easy change but we need to think about templated files like that and see how other SSGs handle it. Right now the templates dir doesn't map to an output file so it shouldn't be in the same folder but even then, I'm still wondering if that's the right approach. We could eg generate a hash for the current content and tell users to assign it to window in their base template and then they can use that variable for cachebusting in JS without adding any new concepts in Zola.

Jieiku commented 1 year ago

It seems Hugo has a features called js.Build... but it seems a lot more complicated than it needs to be, possibly they are trying to do more than simply creating a file from a template: https://gohugo.io/hugo-pipes/js/

hugo can create a js file from a template using ExecuteAsTemplate: https://gohugo.io/hugo-pipes/resource-from-template/ ( I have never tried, its been years since I have used Hugo, and back then I did not write much javascript code )

It looks like eleventy//11ty can template a js file, I have never used this tool before, so the documentation is not very clear to me, but here is the page: https://www.11ty.dev/docs/languages/javascript/

It appears they name their templates *.11ty.js (maybe this is a way to differenciate between a regular js file and one that is meant to be a template?)

Jieiku commented 1 year ago

I tried implementing this: https://github.com/Jieiku/zola/commit/995a0d39ac96b7a93ea2a8f52862062c2c8775ce

but now I am trying to figure out how to handle the chicken and the egg dilemma

I setup a couple test sites on this repo, for more info check the README

but basically the site builds and the search.js is created properly with the cachebust... unless I try to reference search.js in a template, eg in the head of my index with get_url: https://github.com/Jieiku/zola-test-sites/blob/18ccc95248fe68bb6b329d14a06e29bb6f4de80f/fails/templates/base.html#L6

Error: Failed to build the site
Error: Failed to render section '/home/jieiku/zdev/zola-test-sites/fails/content/_index.md'
Error: Reason: Failed to render 'index.html' (error happened in 'base.html').
Error: Reason: Function call 'get_url' failed
Error: Reason: `get_url`: Could not find or open file search.js

I am going to give this more thought, I am not yet sure how best to resolve this.

(This is only the second time I have worked with rust code, the first time was when I made a small change to the tinysearch library, help from anyone is appreciated if you think you have an idea.)

Keats commented 1 year ago

You'll need to update the paths where get_url is looking at: grep for search_for_file. I'm not still sure it's the right way though

Jieiku commented 1 year ago

I would have multiple uses for this feature beyond the cachebust for the search.js So if you decide to do something different then I could at least continue to use my own fork.

Appreciate the tip on what to look for, going to have time to work on this again tomorrow :)

Keats commented 1 year ago

I would have multiple uses for this feature beyond the cachebust for the search.js

Can you describe them?

Jieiku commented 1 year ago

These are just the ones I immediately could make use of, there could also be other benefits to being able to use Tera Templates for JS or JSON files that I am not yet thinking of...

First reason is to be able to add the cachebust, baseurl, and language when loading the json search index

https://github.com/Jieiku/zola-test-sites/blob/main/works/file_templates/search.js#L132

Second reason is to be able to generate json data that can later be consumed for other search engine tools to create their index. (tinysearch, stork, etc.)

I can allow not only js, but also json. Tinysearch builds its search index from a json file.

Currently because Zola can ouput html, I basically dump the json data into an html file:

https://github.com/Jieiku/abridge/blob/master/templates/tinysearch_json.html

Stork is similar but uses a toml file instead of json, I do the same here, dump it into an html file:

https://github.com/Jieiku/abridge/blob/master/templates/stork_toml.html

This works but because its an html file it ends up in the site index:

https://github.com/tinysearch/tinysearch/issues/166

Third reason is that you could then use a facade to delay the loading of certain js features.

Lets say you have a fairly heavy javascript feature, it might be a search engine, or some other js tool/script. Normally I would not want to load those up front, especially if they only get used by 10% of my visitors. One way to handle this is delaying the loading of the js. You basically make it so that the js does not load until the control or feature is clicked on or focused.

Similar to this: https://github.com/Jieiku/abridge/blob/master/static/search_facade.js

Currently to do anything like that you would have to use 3rd party tools such as npm to build the hashes.

Fourth reason is I can consolidate some of my javascript code to a single file or fewer files.

I can include blocks of javascript based on config values in the themes config.toml:

mytheme.js:

{%- if config.extra.themeswitcher %}

document.getElementById('mode').addEventListener('click', () => {
  document.documentElement.classList.toggle('light');
  localStorage.setItem('theme', document.documentElement.classList.contains('light') ? 'light' : 'dark');
});

{%- endif %}

{%- if config.extra.protect_email %}

(function() {
    // Find all the elements on the page that use class="m-protected"
    var allElements = document.getElementsByClassName('m-protected');

    // Loop through all the elements, and update them
    for (var i = 0; i < allElements.length; i++) {
        // fetch the hex-encoded string from the href property
        var encoded = allElements[i].getAttribute('href');

        // decode the email address
        var decoded = atob(encoded.substring(1));

        // Set the link to be a "mailto:" link
        allElements[i].href = 'mailto:' + decoded;
    }
})();

{%- endif %}
Jieiku commented 1 year ago

You'll need to update the paths where get_url is looking at: grep for search_for_file.

You NAILED it, that small change and its now building! (I added file_templates path for search_for_file )

It is now saying the hash does not match, and I see why.

2023-04-24_17-33-37

What happens, is that it creates the hashes for search.js before Zola modifies it to update the fetch line.

It creates a hash when its like this:

var fetchURL = '{{ get_url(path="search_index." ~ lang ~ ".json", trailing_slash=false, cachebust=true) | safe }}';

I verified this by checking the hash of the template before it is processed:

2023-04-24_17-41-24

After Zola processes the template, it becomes this:

var fetchURL = 'https://abridge.netlify.app/search_index.en.json?h=a6e47a0d153131488e74';

it should have had this hash:

2023-04-24_17-43-23

It is kinda like the processing for the hashes related to search.js (which gets inserted into the HEAD of base.html) need to be delayed until search.js is in the output folder, at which point it would be fully built.

Another solution, would be that whenever its using get_url and other hash function on files in the file_templates directory, it should process them with Tera before creating the hash....

I am not sure what would be best, I think parsing them with Tera before grabbing the hash from that would be simplest?

I live in WA, USA (LOTS OF RAIN!) I have been waiting for dry weather for some work that I need to get done outdoors, so I will probably only be able to work on it for a couple hours at the end of the day for the next week. (finally dry weather)

Jieiku commented 1 year ago

I am just about finished with adding multi-lingual support to abridge.

Need to find somebody fluent in French to check my translations here: https://github.com/Jieiku/abridge/issues/108

It occurred to me that we should absolutely grab the language code from the DOM instead of during site creation.

Otherwise there is no way to use a single js file for the search, you would need one js file per language.

So that is what I did here:

https://github.com/Jieiku/abridge/blob/08852030541c7a4c0b666a7fa99ca421d9c46bcb/static/search.js#L168

It now fetches the correct index, search_index.en.json for english and search_index.fr.json for french.

I also found a bug with the new json index feature and opened an issue for it: https://github.com/getzola/zola/issues/2193