aws / sagemaker-spark-container

The SageMaker Spark Container is a Docker image used to run data processing workloads with the Spark framework on Amazon SageMaker.
Apache License 2.0
36 stars 74 forks source link

Revise default Spark and Yarn configs based on instance info #8

Closed mmeidl closed 4 years ago

mmeidl commented 4 years ago

Add build-time script to fetch EC2 instance type info config file Bootstrapper checks instance type/count info to compute and set Spark and Yarn configs Basic strategy is 1 executor per instance using all available cores and memory, minus 2GB for driver memory

Issue #, if available:

Description of changes:

This change allows the SM Spark container to set reasonable default configs on Spark and Yarn to maximize cluster resource utilization. We introduce a build-time script to query EC2 for latest instance metadata which is retained as a config file in the container image. At container runtime, the Bootstrapper class checks for SageMaker ProcessingJob metadata to determine instance count/type, and uses this to query instance type info like available memory and CPU cores. In case any of this metadata is missing, the Bootstrapper falls back to estimate available resources from psutil. Using all this info, the Bootstrapper calculates default resource configs for Spark and Yarn, borrowing heuristics from EMR, then writes these configs into respective config files (spark-defaults.conf, yarn-site.xml).

This change has been tested with all standard SM/Spark integration tests in this package, as well as a number of benchmark jobs using queries derived from TPC-DS. It is stable for the majority of instance types tested at 10G and 100G scale.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

mmeidl commented 4 years ago

CodeBuild run succeeded:

SparkContainerCodeBuildProject-Test:d06b8494-4216-40bb-ac56-27f7a10fdb99
Build status
Status Succeeded
Initiator Admin/mmeidl-Isengard
Build ARN arn:aws:codebuild:us-west-2:552588484154:build/SparkContainerCodeBuildProjectTest:d06b8494-4216-40bb-ac56-27f7a10fdb99
Resolved source version ab9db18ce5a7e0ea7bba8876e93b152da535d993
Start time Sep 1, 2020 1:57 PM (UTC-7:00)
End time Sep 1, 2020 2:25 PM (UTC-7:00)
Build number 128

(ReadOnly: https://tiny.amazon.com/3q5y9mh4/IsenLink)