geolexica / geolexica-server

Generalized backend for Geolexica sites
2 stars 1 forks source link

Need faster deploys #160

Open skalee opened 3 years ago

skalee commented 3 years ago

Deploying IEV site took over an hour, most of which (50 minutes) was spent on sending produced files to S3. We need to speed it up.

Currently we deploy with our custom Rake task defined here: https://github.com/geolexica/geolexica-server/blob/master/lib/tasks/deploy.rake. Under the hood it uses s3 sync, an official AWS tool.

Some ideas how to deal with that can be found in https://github.com/glossarist/iev-demo-site/issues/66.

skalee commented 3 years ago

@ronaldtse I got two questions:

ronaldtse commented 3 years ago

If I end up with creating a brand new tool (which is possible, because these slow uploads are likely caused by poor parallelism), does it matter if it's Node or Ruby tool?

No, as long as you can maintain it.

During upload site may be inconsistent (some pages old, some new). Is it a problem? If yes, then there are two options:

  1. We may upload site to a temporary bucket and then copy it to a proper one. Copying files over buckets in the same region should be much faster than uploads, especially with S3P tool you found.

Great idea! GitHub now also supports environments - so that you can queue deploys that if one job is running, the other jobs are queued. In this case, we can use S3 Transfer Acceleration for the temporary bucket (as long as it does not contain '.' dots).

  1. Alternatively, we can display some maintenance page.

This is probably necessary in either case.

The third option is to use AWS DynamoDB or MongoDB Cloud Atlas, which will be necessary for high frequency update workloads.

ronaldtse commented 3 years ago

https://github.com/cobbzilla/s3s3mirror seems to work for mirroring.

ronaldtse commented 3 years ago

I just found out that we could enable Transfer Acceleration if we rename the buckets to remove the dots. It's now possible to use an arbitrarily named S3 bucket as an origin for CloudFront, so we can use "example-com" instead of "example.com" as the bucket name. Let me see what we can do.

skalee commented 3 years ago

The third option is to use AWS DynamoDB or MongoDB Cloud Atlas, which will be necessary for high frequency update workloads.

Is this any expected? I though glossaries will not be updated very frequently.

skalee commented 3 years ago

I just found out that we could enable Transfer Acceleration if we rename the buckets to remove the dots. It's now possible to use an arbitrarily named S3 bucket as an origin for CloudFront, so we can use "example-com" instead of "example.com" as the bucket name. Let me see what we can do.

AWS docs say:

You might want to use Transfer Acceleration on a bucket for various reasons:

  • Your customers upload to a centralized bucket from all over the world.
  • You transfer gigabytes to terabytes of data on a regular basis across continents.
  • You can't use all of your available bandwidth over the internet when uploading to Amazon S3.

Doesn't sound like our case.

ronaldtse commented 3 years ago

Frequency: it’s also the burst frequencies, eg if people make subsequent changes quickly.

I found a way to make transfer acceleration work with cloud front, but it requires a separate lambda@edge to return index.html in order to mimic S3 website functionality.

In this case we may not need two buckets but let’s see.

skalee commented 3 years ago

Frequency: it’s also the burst frequencies, eg if people make subsequent changes quickly.

Wow, sounds like very different thing than deploys we have now. If burst updates can happen, then slow uploads aren't our only problem. Building the full site from scratch will be too slow too. Note that IEV has 20k concepts or so. We need some kind of incremental site builds in GHA to handle burst updates. Or throttling, or debouncing.

skalee commented 3 years ago

Also, we need to prevent race conditions between deploys.

skalee commented 3 years ago

I'm not sure what exactly Paneron will be responsible for when it comes to site generation, so this may be a silly idea: We can use Paneron to generate concept pages, and then use Jekyll to bind them into a site. Jekyll supports incremental site generation, so if we modify a few files only, then it should finish quite fast. Then we need to upload these modified files without touching the others — maybe s3 sync will do much better in such case.

Obviously that won't speed up full site rebuilds which we need too.

skalee commented 3 years ago

My new idea involves persisting generated site across builds. This is going to be a separate Git repo (maybe hosted on GitHub, maybe existing just in GHA cache, it doesn't really matter) because I don't trust file timestamps as much as commit dates. File modification timestamp can be updated for any reason whereas git commit date means actual change to file contents.

In steps (all done in GHA):

  1. Obtain generated site (Git repo) from previous builds.
  2. Rebuild site (incrementally or not).
  3. Commit all the differences.
  4. List all files in generated site along with their last commit timestamp.
  5. List all files in S3 bucket along with their last modification timestamp.
  6. Send only these files which have changed since the last deploy.

This approach should greatly reduce deploy time as compared to s3 sync. The latter compares MD5 hashes in order to tell which files have changed. Whilst this is a great idea in general case, it surely takes some time, even though files stored in S3 have these hashes already computed (unless given bucket is encrypted). Alternatively, s3 sync can look at file sizes which is much faster, but not that reliable.

ronaldtse commented 3 years ago

@skalee I think a more comprehensive approach is needed for S3 bucket sync; synching unchanged items is clearly not desired. A possible mechanism is to maintain a hash index at the root (with hash keys of all files), which is updated by some cron/lambda function, so that when we upload something we can match up which files need (or not) updating.

skalee commented 3 years ago

FYI I've just triggered re-deploy on iev-demo-site and it's slow again, despite the facts that nothing was changed and that most files are identical.

ronaldtse commented 3 years ago

These are relevant features: