Explore options to improve application startup time

maurizi commented 2 years ago

My testing for #960 revealed that the deserialize operation itself is not a significant cause of our slow startup time loading TopoJSON layers - suggesting that even in production, the time is predominately spent loading data from S3.

We should explore options for loading the assets from disk instead as a way to improve startup times (caching, a centralized EBS volume, storing the assets in the docker container?) as well as try to estimate how much time savings we would expect to see.

BryanQuigley commented 2 years ago

Speed was pretty reasonable from S3 but did vary a lot from 5 MiB to 20 at different times.

I'm thinking three things could explain/speed it up:

Separate out downloading items from S3. Make it a separate program that runs before the main one. Then the main app loads it from disk as it's only option.
Compress the items on S3 - some items compress really well - input.geojson 231 MB -> 44 MB
Reduce number of small files in mutiple directories (or compress them as one tarball). (I just had this issue with another project, trying to copy individual files was going to take 2 hours, once compressed to 400 MB it took 1 minute)

maurizi commented 2 years ago

We actually don't load input.geojson on the server, or anything in the tile directory.

The files we're primarily concerned with for server load are the .buf files actually - topo.buf being the primary & very large one, plus a dozen or so much smaller .buf files.

ddohler commented 2 years ago

Some other factors that I think are likely to come into play are instance networking speed and potentially EBS volume networking speed. It looks like the instances we use have burstable IO for both EBS and networking (that's what "up to" means in this table): https://aws.amazon.com/ec2/instance-types/r5/ , and there are also different speeds for EBS and Network bandwidth.

Benchmarking results will be interesting to see because if the app is bandwidth limited then we might be accidentally paying the networking costs twice if we're loading S3->EBS before loading into memory, but on the other hand if the limiting factor is how much bandwidth S3 is willing to push out to a single client (20MiB/sec doesn't seem all that close to 10Gbps or even 4.75Gbps) then maybe we could gain some speed by downloading in multiple threads.

maurizi commented 2 years ago

I did some investigative work as part of #1128, which revealed that downloading from S3 is not merely a large portion of application startup time - it is nearly all of it.

For even our largest region TX, loading the file from disk and deserializing it took only 12 seconds (!!!).

We should definitely implement file caching.

I'm thinking we want something like the following:

A script to pull the latest region configs from the staging / production database and create an EBS snapshot with the latest serialized TopoJSON data from the S3 directory (and we should consider putting the other .buf and .json files the server consumes on EBS as well for data consistency, but that's not important from a performance standpoint). We could either configure tasks to use the most recent EBS snapshot at the time of startup (and update data using our existing procedure of cycling in new tasks), or we could cycle out which EBS volume is associated with the currently running tasks w/o restarting them.
Update the code to load TopoJSON files from a local data store (either EBS instead of from S3
A similar script to put the same data (based on what is in the dev database) into a docker volume cached on the host somewhere

If we switch back to Fargate at some point in the future (which we should consider after implementing the various performance improvements we have in place) we'd have to replace EBS w/ EFS, but conceptually everything would remain the same.

maurizi commented 2 years ago

As Derek pointed out, my tests were done locally using a nice NVMe SSD, so we maybe can't expect 12 second load times out of EBS: https://github.com/PublicMapping/districtbuilder/pull/1128#discussion_r803746281

It's worth benchmarking EBS performance to see what we can expect.

Another option to consider is using local NVMe storage, and pre-loading the data into the AMI, which won't give us instant load times (the docs say it can take up to 5 minutes to copy the AMI image to the machine), but would still likely be substantially faster than loading from S3, and could perhaps offer better performance in a cache-miss scenario than EBS.

ddohler commented 2 years ago

I'm hopeful that an EBS volume will be fast enough -- my recollection is that instances with local instance storage don't come with batteries included, so you have to mount the drives and format them yourself, which would add a lot of extra moving parts to AMI creation. But they are very fast once you get them up and connected to everything else.

maurizi commented 2 years ago

Knowing the performance of using EBS for our data caching will help us decide how to move forward on reducing application startup times.

Closing this, we'll re-assess load times after implementing #1138 (which will end up using instance storage, which is EBS-backed).

PublicMapping / districtbuilder

Explore options to improve application startup time #1117