Closed maurizi closed 2 years ago
Speed was pretty reasonable from S3 but did vary a lot from 5 MiB to 20 at different times.
I'm thinking three things could explain/speed it up:
We actually don't load input.geojson
on the server, or anything in the tile directory.
The files we're primarily concerned with for server load are the .buf
files actually - topo.buf
being the primary & very large one, plus a dozen or so much smaller .buf
files.
Some other factors that I think are likely to come into play are instance networking speed and potentially EBS volume networking speed. It looks like the instances we use have burstable IO for both EBS and networking (that's what "up to" means in this table): https://aws.amazon.com/ec2/instance-types/r5/ , and there are also different speeds for EBS and Network bandwidth.
Benchmarking results will be interesting to see because if the app is bandwidth limited then we might be accidentally paying the networking costs twice if we're loading S3->EBS before loading into memory, but on the other hand if the limiting factor is how much bandwidth S3 is willing to push out to a single client (20MiB/sec doesn't seem all that close to 10Gbps or even 4.75Gbps) then maybe we could gain some speed by downloading in multiple threads.
I did some investigative work as part of #1128, which revealed that downloading from S3 is not merely a large portion of application startup time - it is nearly all of it.
For even our largest region TX, loading the file from disk and deserializing it took only 12 seconds (!!!).
We should definitely implement file caching.
I'm thinking we want something like the following:
.buf
and .json
files the server consumes on EBS as well for data consistency, but that's not important from a performance standpoint).
We could either configure tasks to use the most recent EBS snapshot at the time of startup (and update data using our existing procedure of cycling in new tasks), or we could cycle out which EBS volume is associated with the currently running tasks w/o restarting them.If we switch back to Fargate at some point in the future (which we should consider after implementing the various performance improvements we have in place) we'd have to replace EBS w/ EFS, but conceptually everything would remain the same.
As Derek pointed out, my tests were done locally using a nice NVMe SSD, so we maybe can't expect 12 second load times out of EBS: https://github.com/PublicMapping/districtbuilder/pull/1128#discussion_r803746281
It's worth benchmarking EBS performance to see what we can expect.
Another option to consider is using local NVMe storage, and pre-loading the data into the AMI, which won't give us instant load times (the docs say it can take up to 5 minutes to copy the AMI image to the machine), but would still likely be substantially faster than loading from S3, and could perhaps offer better performance in a cache-miss scenario than EBS.
I'm hopeful that an EBS volume will be fast enough -- my recollection is that instances with local instance storage don't come with batteries included, so you have to mount the drives and format them yourself, which would add a lot of extra moving parts to AMI creation. But they are very fast once you get them up and connected to everything else.
Knowing the performance of using EBS for our data caching will help us decide how to move forward on reducing application startup times.
Closing this, we'll re-assess load times after implementing #1138 (which will end up using instance storage, which is EBS-backed).
My testing for #960 revealed that the
deserialize
operation itself is not a significant cause of our slow startup time loading TopoJSON layers - suggesting that even in production, the time is predominately spent loading data from S3.We should explore options for loading the assets from disk instead as a way to improve startup times (caching, a centralized EBS volume, storing the assets in the docker container?) as well as try to estimate how much time savings we would expect to see.