Open veganstraightedge opened 4 years ago
As a proof of concept I ran a crawler (wget with mirror settings) followed by a script to sanitise the data.
I was able to get a partial static read-only copy of the production website.
For briefness:
Not all pages have been included (I ran it rate limited not to impact the site and did not let it complete).
Fixed it to the crimethinc domain so it is not ingesting other domains e.g. podcast episodes, this could be rethought. (lite., cloudfront., etc)
No changes to the website app.
Can be scheduled and maintenance as a separate tool.
we can keep historical copies.
keep copies with and without large external media.
we can run it closer to the production server for faster backups.
pain points for CDN/caching/mirror:
POC AWS http://ctmirror.s3-website.eu-west-2.amazonaws.com/ POC Netlify https://eloquent-swirles-d1e8f8.netlify.app/
This is awesome! @goncalopereira
Great start! What're the next steps? What are the open questions to consider?
I think a 2nd opinion would be great. I can create a PR with the ongoing scripts, need to figure out the project structure for it.
I think the questions are:
How to run it in prod without affecting or being blocked / can we get a prod db ?
What mirrors are we supporting
What subdomains or external websites need caching
Fix headers for Prod would make it more efficient (and Prod caching in itself)
Fix mixed content on Prod if possible.
i don't think crawling is the best way to think about this, because then you have to recrawl everything (or parts? or what? hard to decide!) whenever content changes.
what some dynamic sites do is that they internalize the "crawler", or more accurately, the static generation. each page rendering is actually stored on disk, which doubles as a fast cache which helps for denial of service conditions. i worked on Drupal sites in the past which used the "boost.module" to do this, but that didn't work well to create a static site copy. i think there's something better for drupal now, but that's irrelevant since you don't use drupal. :p (Django similarly can drive static sites too.)
So I guess the question, IMHO, is how to do this caching thing but with Rails as a backend. I frankly have close to zero experience coding in Rails, but a few searches gave me this documentation, where "page cache" certainly looks interesting.
Note that you'd still have to have something that crawls the entire site (maybe? or maybe rails is magic and will do that on its own?) but the difference is that then you have a server-side archive that you can more easily distribute, and that's a trusted copy that you don't necessarily need to refresh all the time. Whenever you post something new, as soon as someone reads it, it gets cached and added to the pile.
This beats recrawling everything all the time...
Thanks @anarcat.
I agree that an internal static site generator is also a good idea. We already do do a fair bit of caching in Rails land, but that's dependent on have a big Redis server running. And isn't easy to hand off to another person/place hosting a copy of the site.
IMO, a happy medium would be if the Rails CMS generated static files/folders of the site, then shipped it off-site somewhere, as both files/folders ready to serve as a static site and as a gzip/tarball for others to download and mirror, if needed.
Being able to easily spin up a new Rails/Postgres/Redis would be a nice to have too, but not as easy for many people in many situations to run than just a static classic web server.
On 2022-02-24 14:21:25, Shane Becker wrote:
Thanks @anarcat.
I agree that an internal static site generator is also a good idea. We already do do a fair bit of caching in Rails land, but that's dependent on have a big Redis server running. And isn't easy to hand off to another person/place hosting a copy of the site.
IMO, a happy medium would be if the Rails CMS generated static files/folders of the site, then shipped it off-site somewhere, as both files/folders ready to serve as a static site and as a gzip/tarball for others to download and mirror, if needed.
yeah i think that's what the page cache is supposed to do, but maybe that's what you're already doing in redis?
Being able to easily spin up a new Rails/Postgres/Redis would be a nice to have too, but not as easy for many people in many situations to run than just a static classic web server.
yeah for sure, it's more of a quick disaster recovery for you i guess.
We're still going to run a Rails app for the
.com
and the CMS.A static snapshot of the site could serve as a read replica mirror.
This issue is to create a way to create a static version of the site, which could then be hosted just about anywhere.