boozallen / aissemble

Booz Allen's lean manufacturing approach for holistically designing, developing and fielding AI solutions across the engineering lifecycle from data processing to model building, tuning, and training to secure operational deployment
Other
34 stars 8 forks source link

Feature: Java 17 Upgrade Support Series - Pipeline persistence #398

Closed ewilkins-csi closed 1 month ago

ewilkins-csi commented 1 month ago

Description

In #356 we upgraded the bulk of Spark pipeline functionality to Java 17. This tackles the follow-on work of updating the persistence logic of pipeline steps. The available persistence types are:

DOD

Acceptance criteria required to realize the requested feature

Test Strategy/Script

Main Test

  1. Create a new project on 1.10.0-SNAPSHOT.
    mvn archetype:generate '-DarchetypeGroupId=com.boozallen.aissemble' \
                           '-DarchetypeArtifactId=foundation-archetype' \
                           '-DarchetypeVersion=1.10.0-SNAPSHOT' \
                           '-DgroupId=org.test' \
                           '-Dpackage=org.test' \
                           '-DprojectGitUrl=test.org/test.git' \
                           '-DprojectName=Final Test 398' \
                           '-DartifactId=final-398' \
    && cd final-398
  2. Set your Java version to 17 if it is not currently
  3. Add a Spark Pipeline model with a step for each persist type & mode.
  4. Fully generate the project by running mvn clean install and following manual actions
  5. Unzip the pipeline step code in the root of the project to add it to the pipeline. This logic passes a sample dataset between each storage format appending new data each time.
  6. Build the project without the cache and follow the last manual action.
    mvn clean install -Dmaven.build.cache.skipCache
  7. Deploy the project.
    tilt up; tilt down
  8. Once all the resources have started (may take awhile) execute the pipeline. (It will be persist-pipeline in Tilt if you've used the provided model.)
  9. Verify that the pipeline completes successfully.
  10. Verify that a dataset was successfully written to each persistence type. _Tip: The logs are quite busy so filtering by "StepBase:" via the Tilt UI will help you find these lines much faster._
    ...
    24/10/11 15:35:14 INFO HiveStepBase: Saved HiveStep to Hive
    ...
    24/10/11 15:35:34 INFO DeltaLakeStepBase: Saved DeltaLakeStep to Delta Lake
    ...
    24/10/11 15:36:46 INFO Neo4jStepBase: Saved Neo4jStep to Neo4j
    ...
    24/10/11 15:36:48 INFO ElasticsearchStepBase: Saved ElasticsearchStep to Elasticsearch

Upgrade Test

  1. Create a new project on 1.9.2.
    mvn archetype:generate '-DarchetypeGroupId=com.boozallen.aissemble' \
                           '-DarchetypeArtifactId=foundation-archetype' \
                           '-DarchetypeVersion=1.9.2' \
                           '-DgroupId=org.test' \
                           '-Dpackage=org.test' \
                           '-DprojectGitUrl=test.org/test.git' \
                           '-DprojectName=Upgrade Test 398' \
                           '-DartifactId=upgrade-398' \
    && cd upgrade-398
  2. Set your Java version to 11 if it is not currently
  3. Add the delta-core_2.12 dependency to the root pom.xml
    ...
      <dependencies>
    +    <dependency>
    +      <groupId>io.delta</groupId>
    +      <artifactId>delta-core_2.12</artifactId>
    +      <version>2.4.0</version>
    +    </dependency>
        <!-- START: workaround to get maven build cache invalidation on new SNAPSHOTS of commonly updated plugins -->
        <dependency>
    ...
  4. Add Spark and Pyspark pipelines with a step that persists to delta-lake
  5. Fully generate the project by running mvn clean install and following manual actions
  6. Verify the following values files have the delta-core and delta-storage dependencies listed in sparkApp.spec.deps.packages:
    • upgrade-398-pipelines/spark-pipeline/src/main/resources/apps/spark-pipeline-base-values.yaml
    • _upgrade-398-pipelines/pyspark-pipeline/src/pysparkpipeline/resources/apps/pyspark-pipeline-base-values.yaml
  7. Update the build-parent version in the root pom.xml to 1.10.0-SNAPSHOT
  8. Update the smallrye-reactive-messaging-kafka dependency to look like the following within upgrade-398-pipelines/spark-pipeline/pom.xml (Workaround from #263 bug):
        <dependency>
            <groupId>io.smallrye.reactive</groupId>
            <artifactId>smallrye-reactive-messaging-kafka</artifactId>
            <version>${version.smallrye.reactive.messaging}</version>
        </dependency>
  9. Set your Java version to 17 if it is not currently
  10. Run mvn org.technologybrewery.baton:baton-maven-plugin:baton-migrate (install would fail with compile errors because of missing migrations for #263.)
  11. Verify the following values files have the delta jars updated to 3.2.1 and delta-core has been renamed to delta-spark:
    • upgrade-398-pipelines/spark-pipeline/src/main/resources/apps/spark-pipeline-base-values.yaml
    • _upgrade-398-pipelines/pyspark-pipeline/src/pysparkpipeline/resources/apps/pyspark-pipeline-base-values.yaml
  12. Verify the dependency added in step 3 has changed from delta-core to delta-spark and the version is set to ${version.delta}
ewilkins-csi commented 1 month ago

DoD with @carter-cundiff @csun-cpointe

ewilkins-csi commented 1 month ago

OTS with @csun-cpointe

csun-cpointe commented 1 month ago

Final tests passed!!

Main Test Verification: Screenshot 2024-10-11 at 2 17 32 PM Screenshot 2024-10-11 at 2 18 02 PM

Upgrade Test Screenshot 2024-10-11 at 2 33 02 PM Screenshot 2024-10-11 at 2 36 01 PM