Booz Allen's lean manufacturing approach for holistically designing, developing and fielding AI solutions across the engineering lifecycle from data processing to model building, tuning, and training to secure operational deployment
Other
29
stars
7
forks
source link
As a Data Engineer, I want to use Spark 3.5 so I can leverage the latest enhancements and fixes. #55
We are currently on Spark version 3.4.0. There were some security patches released in 3.4.3 that we want to pull in, but since the upgrade to 3.5 seems simple enough we can go ahead and make the jump to the very latest version: 3.5.1.
Replace spark.yarn.executor.failuresValidityInterval with spark.executor.failuresValidityInterval across all files
Replace spark.yarn.max.executor.failures with spark.executor.maxNumFailures across all files
Note spark operator's SparkApplication CRD does not have any specific YAML properties around these configs
Update release notes
Note the upgrade to Spark 3.5
Add migration(s) to table
Add patched CVEs
BDD Scenarios
Baton migration:
Scenario: My data pipelines are migrated to Spark 3.5.1
Given a file using the Spark config property spark.yarn.executor.failuresValidityInterval and spark.yarn.max.executor.failures
When the aissemble 1.7 Spark migration is executed
Then the file references are updated to spark.executor.failuresValidityInterval and spark.executor.maxNumFailures respectively
Test Steps
TBD - Stand up two simple spark/pyspark data pipelines that read in some data to Spark, do some small transform, and write the data back out.
Background
We are currently on Spark version 3.4.0. There were some security patches released in 3.4.3 that we want to pull in, but since the upgrade to 3.5 seems simple enough we can go ahead and make the jump to the very latest version: 3.5.1.
Definition of Done
spark.yarn.executor.failuresValidityInterval
withspark.executor.failuresValidityInterval
across all filesspark.yarn.max.executor.failures
withspark.executor.maxNumFailures
across all filesBDD Scenarios Baton migration:
Test Steps
TBD - Stand up two simple spark/pyspark data pipelines that read in some data to Spark, do some small transform, and write the data back out.