adding content hash to js/css filenames

ffigiel commented 3 years ago

I think it would be useful if the js/css filenames had a content hash in them, so that when a new version of my page is deployed, the browsers load the new assets immediately. Is this sort of a feature within scope of elm pages?

For now, here's a quick script I wrote to achieve this. This should be run after running elm-pages build EDIT: updated the script to support both Linux and MacOS

#!/usr/bin/env bash

set -e

hash() {
  # there is no md5sum command on MacOS, instead we need to use `md5 -r``
  if command -v md5sum > /dev/null; then
    md5sum $@
  else
    md5 -r $@
  fi
}

addContentHash() {
  # update a filename to include its content hash
  NAME=$1
  EXT=$2
  OLDNAME="${NAME}.${EXT}"
  NEWNAME="${NAME}-$(hash ${OLDNAME} | cut -c 1-6).${EXT}"
  mv ${OLDNAME} ${NEWNAME}
  # update html files to use the new name
  # we specify the backup suffix to be compatible with MacOS
  sed -i'.bak' -e "s|\"/$OLDNAME\"|\"/$NEWNAME\"|" $(find . -name '*.html')
  # remove backup files
  rm $(find . -name '*.html.bak')
}

cd dist

addContentHash index js
addContentHash elm js
addContentHash style css

ffigiel commented 3 years ago

The script above doesn't modify elm-pages.js, but it seems to be unused anyway (#219)

$ grep -rP \(index.js\|elm.js\|style.css\) .
./elm-pages.js:import userInit from "/index.js";

dillonkearns commented 3 years ago

It's an interesting idea, I wonder what the best practice is these days with CDN hosting. I think a CDN service like Netlify or Vercel would take care of using etags to ensure that these can be cached, right?

I'd be curious to hear how modern tools, like Vite or others are handling this.

ffigiel commented 3 years ago

I'm not an expert on that part of the web ecosystem, but I suppose CDNs do caching well enough. Still, relying on e-tags means you need to send a request to the server anyway to know if the asset is up-to-date.

Having a hash in the filename lets you skip that request by taking advantage of the browser cache. Now it occurred to me that in case of an elm application the benefit may not be that big since it's just one js bundle, but it could be useful for the generated content.json files.

...it also lets your manager see the latest version without having you tell them to hit ctrl+f5 :sweat_smile:

ffigiel commented 3 years ago

I'd be curious to hear how modern tools, like Vite or others are handling this.

Here's one data point: Nuxt 2 does content hashing out of the box:

dist
├── 200.html
├── index.html
├── jobs
│   └── index.html
└── _nuxt
    ├── 2aade51.js
    ├── 3682170.js
    ├── 45bb05f.js
    ├── b5e44ac.js
    ├── e1bb572.js
    ├── edcbabc.js
    └── LICENSES

dillonkearns commented 3 years ago

Still, relying on e-tags means you need to send a request to the server anyway to know if the asset is up-to-date.

Yeah, that is a nice benefit. content.json is a bit different because it needs to be able to find the values with a predictable pattern. At least it depends on that right now, but I don't think tools like NextJS or Gatsby do hashing on their equivalent of content.json for similar reasons.

What I'm still not clear on though is how to take advantage of those hashes. Because doesn't that rely on headers from the server setting time to live (TTL) to a very long length so the files are marked as immutable and cached as long as possible? So it's not in the hands of elm-pages output alone, but requires the server response headers as well. Any ideas on what other frameworks do to coordinate with hosting providers?

ffigiel commented 3 years ago

Yes, you would need to set up the web server accordingly. The main part is configuring cache-control header for html files so that they are kept only for a short period of time - without that header, the browser's default caching algorithm kicks in and keeps the file for several days (it depends on current cache size, user's browsing habits, etc) For the same reason, configuring cache-control with a long caching policy on the "immutable" assets is less important.

Sorry, I don't have input on frameworks/hosting providers for this. I'm only familiar with Nuxt, which comes with its own web server for SSR.

The deeper we go into this conversation, the more I think this pattern is useful for web apps, but perhaps not so much for web pages/blogs, since they don't change and don't get visited as often

dillonkearns commented 2 years ago

Hey @megapctr, so I'm still trying to figure out the best practice here for different hosting providers.

I found a few resources that reference file fingerprinting (i.e. hashing) in Netlify, but it seems like there are some conflicting ideas. Some seem to say that Netlify will automatically take care of this and that using a hash can break the built-in CDN caching, and others seem to talk about using hashes and configuring netlify to pass Cache-Control headers to mark the hashed files as immutable. Here are a few resources I came across that mention hashing assets in Netlify:

It turns out I'm working on a ViteJS integration for elm-pages 3.0, so it will be easy to configure this hashing, and I think I can let the user configure Vite however they want so the ball goes to the user's court for that to figure out what strategy they want to use for bundling, hashing, and hosting their assets. That said, I am still very interested to understand the best practices here better, and I'd like to build in some sane defaults at least to the elm-pages init template and maybe describe that in the elm-pages.com docs as well.

For reference, here's the Discussion thread about the elm-pages Vite integration: https://github.com/dillonkearns/elm-pages/discussions/277.

I'll leave this thread open for now and if anyone has any insights about the best practices for hashing with CDN hosting providers (or wants to do some research), I would love to hear about it.

dillonkearns commented 2 years ago

Oh yes, and there is one more thing to consider for the Vite integration here. With the current Vite setup I'm working on, Vite doesn't deal with the Elm project at all. I think this is good in a lot of ways because the user can freely configure Vite since it is entirely coupled from how elm-pages compiles the Elm code for the project.

Since it is using a separate process, though, the generated elm.js file isn't processed by Vite. I'm trying to decide whether it would make sense to have Vite process that file or not. One possibility would be to have it process it like a normal JS asset, but this would mean I would have to transform the output into ESM. It could also have other side-effects like minifying the elm.js file an extra time, and running whatever custom Vite config for JavaScript files the user applies on the elm.js file (which could be both good and bad potentially).

I'm not quite sure what to make of all of that, my initial inclination is to just generate elm.js as a special file in the top-level dist/ folder, and generate the Vite-processed files in dist/assets (which is Vite's default setup). Seems like a good starting point, and then people can give feedback if they have some concrete use cases where it would be valuable to have it setup otherwise.

ffigiel commented 2 years ago

Hey Dillon, the Vite integration is exciting news! I'm sure complex websites will be much easier to build vite it ✨

Netlify is a fantastic service and I would trust them to follow the latest best caching practices (it's strictly in their business domain, it saves them money and brings customer satisfaction) but it's not the only use case for elm pages. For example if someone already had a web server and just wanted to host their website/blog, they might be better off with having fingerprinted assets rather than setting up an advanced caching system like the one Netlify has.

Vite does fingerprinting out of the box. To me it would be weird to have all the assets fingerprinted except for the elm app for some reason.

I read the resources you linked but I couldn't find how a hash/fingerprint would break the CDN cache.

One potential issue that comes to mind is someone loading an old html file while a new version is deployed, causing the old fingerprinted assets to disappear and causing the website not to load properly. However, I would expect the CDN to keep the old assets around at least for a few seconds after the deployment. Unless these assets were not loaded in a while and CDN cache expired. It's such an improbable scenario I generally wouldn't consider it.

Now I might sound like some fingerprinting acolyte 😛 I just don't see any harm in it.

About other vite topics, I replied in the GH discussion.

dillonkearns commented 2 years ago

I went ahead and added a fingerprint using the same algorithm that Vite uses:

https://github.com/dillonkearns/elm-pages/blob/723f0b22a1901ece38fd9cb177664a5037454e56/generator/src/build.js#L314-L323

https://github.com/dillonkearns/elm-pages/blob/723f0b22a1901ece38fd9cb177664a5037454e56/generator/src/build.js#L389-L391

So although it's not running through Vite directly (which is nice because the user doesn't have to worry about accidentally breaking the way that elm-pages builds things), it gets the same fingerprinting that any of their other cacheable assets do. So I think this is in good shape! Thanks for the conversation. I'll close this as this will be included in the 3.0 release now 👍

dillonkearns / elm-pages

adding content hash to js/css filenames #220