Rethink how SRCF projects share web assets

CHTJonas commented 3 years ago

Project/idea summary

Provide a way for SRCF web assets to be included in a versioned manner, preferably at unique URLs that can be cached infinitely. This will need to account for the fact that some assets will be custom and likely stored in one of the SRCF git repos while others will be vendored copies of well-known and open source libraries. Additionally some assets will be static files like images or video.

For example, the current SRCF stylesheet applied on top of Bootstrap might be found at https://assets.srcf.net/:git_abbrev_commit/srcf-bs.css while a vendored copy of jQuery might be https://assets.srcf.net/vendor/jquery/3.4.1/jquery.min.js.

Motivation

Our current setup stores assets as files accessible using URLs underneath https://www.srcf.net/_srcf/. This works acceptably with the caveat that changes made to those files become immediately visible everywhere that includes them. Browser caching behaviour also comes into effect here as changes made to those files might take a while to 'go live' on users' browsers.

Whilst our existing method has served us well for a while, my opinion is that the lack of versioning will become increasingly annoying as more and more projects start to include the same set of shared assets.

Alternatives considered

A git repo containing the assets that is used as a submodule by any project that wants to include. This potentially has a major caching disadvantage for users' browsers since now each project is including assets locally rather than from a shared location. There are also storage implications to versioning large files in git.

matiasilva commented 3 years ago

This is a great idea! The most salient point for me is that as we grow and work on more things, we want to maintain a global style sheet and set of assets that would benefit from being in proper version control, which this solves. Let's give this a few more days for folks to comment, then I'll approve it.

CHTJonas commented 3 years ago

I wrote written a very basic prototype in go last night, mostly as an academic exercise. It achieves what's listed above but could likely be improved further e.g. selectively caching the output of git show abbrev_commit:file_path. It does have the disadvantage that it is an entirely custom application server which, whilst being very simple, will require some upkeep and maintenance going forwards. asset-server.zip

Instead, @doismellburning suggested in #hackday that the future git repo with our custom (non-vendored) assets in could contain a Makefile that runs something like cp -r * /public/societies/srcf-web/public_html/assets/$(git rev-parse --short HEAD)/ inside the repo. This is probably a much simpler and more accessible solution and could likely be extended in the future were we to start using something like Webpack to bundle our assets.

bmillwood commented 3 years ago

The drawback of those approaches is that your URLs change whenever you make a git commit, even if the file in question doesn't change. Maybe you could do some content-addressable storage thing and use a hash of the file instead? It's a bit trickier to keep all the hashes you care about in order, but hopefully it's still at the complexity of "bash script" rather than "custom application server".

(I can't help but also wonder if either of the downsides of the existing approach are really hurting in practice... are we being stung by caching behaviour? Is "changes immediately visible everywhere" actually bad?)

CHTJonas commented 3 years ago

My thinking was that content would be preserved approximately forever (disk space is cheap) and so old URLs would 'never' expire or have their content removed. As a worked example, if srcf.js is introduced in commit abc1230 and a child commit xyz7890 is made which doesn't affect srcf.js then there would be two duplicate copies of it, for example https://assets.srcf.net/assets/abc1230/srcf.js and https://assets.srcf.net/assets/xyz7890/srcf.js. This is obviously non-ideal but not a deal-breaker IMO and could likely be resolved with some crafty use of rsync in hardlink/differential mode, or some bash scripting.

That all being said, I'm not at all opposed to using content-addressable storage. Are you envisaging something like the following?

#!/bin/bash

cd /path/to/asset/git/repo
STORE="/path/to/htdocs"

find -type f | while read FULL_PATH; do
  DIR_NAME="$(dirname $FULL_PATH)"
  BASE_NAME="$(basename $FULL_PATH)"
  EXTENSION="${BASE_NAME##*.}"
  DIR_PATH="$STORE/$DIR_NAME/$BASE_NAME"
  HASH="$(sha256sum $FULL_PATH | awk '{ print $1 }')"
  mkdir -p "$DIR_PATH"
  cp "$FULL_PATH" "$DIR_PATH/$HASH.$EXTENSION"
done

Browser caching behavior is less important (although currently we don't send a Cache-Control header for resources underneath https://www.srcf.net/_srcf/ so bets are somewhat off). On the other hand we have recently be bitten by changes to the main www site affecting the control panel adversely. As Timeout and LBT grow in complexity, and new projects spring up, I see more sharing of common assets being involved at which point this becomes more important.

bmillwood commented 3 years ago

Yeah, something like that (although I was thinking preserve the original filename and use the hash as a directory name). But you also might want a way to mass-update your hrefs and srcs when there's a new file that you want to opt into using.

Craftiness in rsync-hardlinks seems like also a good solution, but at that stage I don't know which approach involves least craft :)

(Of course I don't have a horse in this race per se, just suggesting ideas).

CHTJonas commented 3 years ago

This might be relevant to the discussion on caching: https://www.stefanjudis.com/notes/say-goodbye-to-resource-caching-across-sites-and-domains/. In summary, Chrome and Safari will use the eTLD or other parts of a site's hostname together with the asset URL to determine the asset's cache key. That is to say that if www.facebook.com sources https://code.jquery.com/jquery-3.5.1.slim.min.js then that will be cached separately to www.srcf.net sourcing the same file at the same URL.

SRCF / projects