Determine resource requirements for Batch jobs

rbreslow commented 3 years ago

The Batch jobs have an existing shape, but we'll have an opportunity to choose different instance types that have more flexibility around temporary storage when we move to a managed CE. We should determine what the least vCPU, memory, and temporary storage requirements are.

The general shape of things as they are now:

Batch jobs require 8 vCPUs and 30 GB of memory.
The container instance in production is an i3.2xlarge, which has 8 vCPUs and 60 GB of memory.
According to instance store volumes, i3.2xlarge has a single 1,900 GB NVMe volume for ephemeral storage.

Some questions to guide thinking:

How much temporary storage does the job actually need?
Can we tune down vCPUs/memory for the job? Right now, we're requesting an instance that wastes 30 GB of memory. If we're able to scale down, we may be able to pack multiple jobs onto a single instance.

KlaasH commented 2 years ago

Ok, I did some looking into this. This isn't an answer, exactly, but it narrows things down and hopefully gives us a path to an answer.

Current resources

CPUs

8 VCPUs. This was increased from 2 in PR #355 because we had to stop running multiple jobs per instance, so it made sense to fully utilize the instances with each job. (Actually it was first increased from 2 to 7, leaving one VCPU for the jobs that wrote out static tile pyramids. Then when we switched to Tilegarden and didn't need that tiling task anymore, that last CPU was added to the analysis job's allocation.)

Memory

Total memory: 30GB. This was increased in PR #355 from 12.3GB. The WORK_MEM, SHARED_BUFFERS, and MAINTENANCE_WORK_MEM variables were also increased (doubled).

I think the total memory a job will need is mostly a function of the size of the network. It's possible that having more CPUs available will result in more working memory being used during calculations, but I think the highest memory needs are likely a function of how big the working tables need to be in the steps that calculate connectivity scores. Though those configuration variables will limit the amount of memory at least some operations will take. My guess is that 30GB is above the point where we would get any performance benefit from adding more, and quite possibly we could trim it a bit without seeing a performance cost.

Storage

The scratch space available on the instances is currently 1,900 GB, which is a lot. Issue #118 says "Production jobs will require up to ~120GB of disk space". I suspect the size of the storage included with our selected instance type might have increased over time. Or possibly we were over-allocated from the beginning because there was no option in the middle between "not enough" and "too much". We were originally running up to 4 jobs at a time on each instance, but that would still only use up a quarter of the scratch space.

I'm not sure how that "~120GB" number was measured. The actual outputs of the analysis are a few GB at most, so as with RAM, I think the storage needs would be driven by what PostgreSQL uses to store and run calculations on the blocks and ways tables.

The PFB_TEMP_FILE_LIMIT is currently set to ~26GB in production-pfb-analysis-run-job.json. The comment about that variable in setup_database.sh says "set to ~1/2 of available disk space".

Conclusions & plans

So yeah, the way we're using resources right now isn't very efficient. I think it got this way by evolving in response to a few constraints--first needing more scratch space than we could get on most instances, then switching to "1 job per instance" and increasing the resources for each job rather than decreasing the size of the instances.

We know what we've allocated currently is enough, but we don't know if the storage and memory we're providing are only a little above what the biggest jobs require or if they're significantly higher than what's needed. I suspect the latter.

The only method I can think of to figure out how much we actually need at the upper limit would be to run a huge job (like NYC) and monitor the memory and storage stats throughout the process. We could also run multiple smaller jobs and try to figure out the shape of the curve between network size and resource requirements, but that seems like more work overall to get a less reliable answer.

Batch compute environment options

Managed Batch compute environments these days can use either EC2 instances or Fargate tasks. For this high-volume but very intermittent workload, Fargate seems like a good option. We would still need to know what amounts of resources to allocate to each task, but we wouldn't have to take the additional step of identifying the closest EC2 instance type to what we need, we would just set the resource parameters for the task itself.

I'm not quite sure whether Fargate is actually an option, though, due to storage needs. You can configure extra ephemeral storage, but only up to 200GiB, and I'm not positive that option is available for Batch tasks. I think it is, but Fargate for Batch doesn't support some job definition parameters, and I'm not entirely clear on whether that includes the ones needed for configuring and attaching ephemeral storage.

Next steps

Run a big test job:
- Revise the staging analysis job parameters to bring them closer to the production ones (maybe not all the way, but at least up to 4 CPUs and a good bit if RAM and storage)
- Spin up an instance in the staging Batch cluster
- Kick off an analysis job for NYC
- Check the monitoring statistics to make sure they'll tell us what we need to know re CPU, RAM, and storage utilization. If not, open a tmux terminal on the underlying instance and run monitoring commands to capture that info.
If the maximum resources used are such that they could fit on Fargate (i.e. it doesn't run out of memory or use more than 200GiB of storage), try to define a Fargate compute environment and task.
If that turns out not to be possible, identify an EC2 instance type that we can use in a managed Batch compute environment and that has the resources we need with minimal excess capacity.

KlaasH commented 2 years ago

I ran a NYC job on my new laptop, since it was there and has specs fairly similar to the EC2 instances we've been using. I used df and landscape-sysinfo to report disk usage and systems stats every 5 seconds, writing the results to a file.

while [ true ];
    do echo $(date) Sysinfo: $(landscape-sysinfo --sysinfo-plugins=Load,Disk,Memory) Disk used: $(df / |grep nvme|awk '{ print $3 }') \
        | tee -a monitoring_output;
    sleep 5;
done

Memory

cat monitoring_output | cut -d ' ' -f 14 | sort -uh

The max value was 49%. System memory is 32 GiB, so that's 16 GiB. But that's the memory state of the whole system, so not all of it is attributable to the analysis. It's hard to know exactly how much to subtract, since system memory usage probably fluctuates a bit and there would be some overhead in the Batch environment as well, but after the analysis stopped the number only went down to 17%, so it's probably fair to reduce the amount we attribute to the analysis by 5 GiB or so.

The usage peaked very early in the process, then went back down to 19% and stayed in the 19%-25% range (1-3 GiB, adjusted for baseline) for the remainder. If it would be useful to limit how much RAM we need, we could figure out the exact step that's causing it to spike and see if it crashes with less RAM or if it's able to adapt and use less if there isn't as much available (e.g. whether it's actually assembling something big that needs to fit in RAM all at the same time and will crash if it can't get enough or if much of the RAM consumption is some sort of buffering).

Note: my system is configured with 2 GB of swap, which started out fully unused and slowly filled up over the course of the analysis. Since the system memory was never more than half-used, my guess is that that was the result of an automatic OS process to move things to swap if they've been sitting idle in memory for long enough, rather than anything the analysis was doing.

Storage

cat monitoring_output | cut -d ' ' -f 26 | sort -uh | head -1
cat monitoring_output | cut -d ' ' -f 26 | sort -uh | tail -1

217989780 - 94554416 = 123435364 = 124 GB

The disk usage seems to increase fairly steadily over the course of the analysis, though once the connectivity is calculated and it starts working on access scores, the rate of increase slows down. As far as I was able to observe, there weren't any big temporary spikes.

Note re PostgreSQL storage space parameters: There was one error message due to the temporary file size limit not being high enough for a file one of the index commands wanted to create:

ERROR: temporary file size exceeds temp_file_limit (10485760kB) STATEMENT: CREATE UNIQUE INDEX IF NOT EXISTS idx_neighborhood_rchblrdshistrss_b ON generated.neighborhood_reachable_roads_high_stress (base_road, target_road); psql:../connectivity/reachable_roads_high_stress_cleanup.sql:1: ERROR: temporary file size exceeds temp_file_limit (10485760kB) ERROR: temporary file size exceeds temp_file_limit (10485760kB) STATEMENT: CREATE INDEX IF NOT EXISTS idx_neighborhood_rchblrdshistrss_t ON generated.neighborhood_reachable_roads_high_stress (target_road); psql:../connectivity/reachable_roads_high_stress_cleanup.sql:2: ERROR: temporary file size exceeds temp_file_limit (10485760kB)

It's possible the storage consumption would have been higher if that had succeeded--if it bailed out on creating an index, then the storage needed for the finished index would never have been claimed. I don't think that means the total storage number above is very far off, but I do think we should increase that temporary file size parameter, maybe by a lot, and probably buffer the storage allocation a little more than we otherwise would have.

Processor

Some of the tasks in the analysis are multi-threaded (using GNU parallel) and some are single-threaded. The monitoring command doesn't report CPU utilization directly, but from observation while the job was running, system load seems to line up pretty closely with CPU utilization in this case. I ran the following to produce a table of load over time by pulling out the load values, rounding to the nearest integer, then counting the occurrences of each value.

cat monitoring_output | cut -d ' ' -f 11 | awk '{printf("%d\n",$1 + 0.5)}'| sort -h | uniq -c

The answer could be different for smaller jobs, if the different operations scale differently, but for this job it was using all 8 CPUs approximately 25% of the time and running single-threaded 75% of the time.

Put another way, compared to the 8 CPUs it had for this run, with 4 CPUs we would expect the whole job to take 25% longer, with 2 we'd expect it to take 75% longer, and with no parallelization at all we'd expect it to take 175% longer.

Conclusion

I think we should be fine allocating 4 CPUs, 12 GiB of RAM, and 150 GB of storage to a job. That would be sufficient for the NYC job and ample for any other jobs. It's also within the bounds of what's available in a Fargate task, provided it's possible to allocate additional ephemeral storage space for Fargate tasks running in Batch, not just ones running in ECS.

So I think the next step is to try that--manually create a Fargate-based Batch execution environment and feed it an analysis job.

azavea / pfb-network-connectivity