Feature: Java 17 Upgrade Support Series - Pipeline persistence

ewilkins-csi commented 1 month ago

Description

In #356 we upgraded the bulk of Spark pipeline functionality to Java 17. This tackles the follow-on work of updating the persistence logic of pipeline steps. The available persistence types are:

hive
delta-lake
rdbms
- Note: this used to be the postgres option but was updated to be rdbms. However, because many of the Java configs still refer to postgres it's unclear if this was tested with anything other than postgres.
- Update: Java's postgres and rdbms options are separate but entangled so neither works
elasticsearch
neo4j

DOD

Acceptance criteria required to realize the requested feature

[x] A pipeline can write to and read from the following data sources while running on the new Java 17 Spark Worker:
- [x] hive
- [x] delta-lake
- ~rdbms~
  - ~MySql (fallback to Postgres if RDBMS hasn't been properly generalized)~
  - ~Clean up config names (if time allows)~ ---- stretch goal ----
- [x] elasticsearch
  - Did not test reading from ES as I couldn't quite figure out the syntax and it was a low-priority item
- [x] neo4j
  - Note: There is a pre-existing issue with Neo4j that prevents reading data from a table. The data can be saved, just not read.
- [x] Write migrations for updates
- [x] Migrate SparkApplication delta-lake dependencies
- [x] Migrate POM delta-lake dependencies

Test Strategy/Script

Main Test

Create a new project on 1.10.0-SNAPSHOT.

mvn archetype:generate '-DarchetypeGroupId=com.boozallen.aissemble' \
                       '-DarchetypeArtifactId=foundation-archetype' \
                       '-DarchetypeVersion=1.10.0-SNAPSHOT' \
                       '-DgroupId=org.test' \
                       '-Dpackage=org.test' \
                       '-DprojectGitUrl=test.org/test.git' \
                       '-DprojectName=Final Test 398' \
                       '-DartifactId=final-398' \
&& cd final-398

Set your Java version to 17 if it is not currently
Add a Spark Pipeline model with a step for each persist type & mode.
Fully generate the project by running mvn clean install and following manual actions
Unzip the pipeline step code in the root of the project to add it to the pipeline. This logic passes a sample dataset between each storage format appending new data each time.
Build the project without the cache and follow the last manual action.
```
mvn clean install -Dmaven.build.cache.skipCache
```
Deploy the project.
```
tilt up; tilt down
```
Once all the resources have started (may take awhile) execute the pipeline. (It will be persist-pipeline in Tilt if you've used the provided model.)
Verify that the pipeline completes successfully.

Verify that a dataset was successfully written to each persistence type. _Tip: The logs are quite busy so filtering by "StepBase:" via the Tilt UI will help you find these lines much faster._

...
24/10/11 15:35:14 INFO HiveStepBase: Saved HiveStep to Hive
...
24/10/11 15:35:34 INFO DeltaLakeStepBase: Saved DeltaLakeStep to Delta Lake
...
24/10/11 15:36:46 INFO Neo4jStepBase: Saved Neo4jStep to Neo4j
...
24/10/11 15:36:48 INFO ElasticsearchStepBase: Saved ElasticsearchStep to Elasticsearch

Upgrade Test

Create a new project on 1.9.2.

mvn archetype:generate '-DarchetypeGroupId=com.boozallen.aissemble' \
                       '-DarchetypeArtifactId=foundation-archetype' \
                       '-DarchetypeVersion=1.9.2' \
                       '-DgroupId=org.test' \
                       '-Dpackage=org.test' \
                       '-DprojectGitUrl=test.org/test.git' \
                       '-DprojectName=Upgrade Test 398' \
                       '-DartifactId=upgrade-398' \
&& cd upgrade-398

Set your Java version to 11 if it is not currently

Add the delta-core_2.12 dependency to the root pom.xml

...
  <dependencies>
+    <dependency>
+      <groupId>io.delta</groupId>
+      <artifactId>delta-core_2.12</artifactId>
+      <version>2.4.0</version>
+    </dependency>
    <!-- START: workaround to get maven build cache invalidation on new SNAPSHOTS of commonly updated plugins -->
    <dependency>
...

Add Spark and Pyspark pipelines with a step that persists to delta-lake
Fully generate the project by running mvn clean install and following manual actions
Verify the following values files have the delta-core and delta-storage dependencies listed in sparkApp.spec.deps.packages:
- upgrade-398-pipelines/spark-pipeline/src/main/resources/apps/spark-pipeline-base-values.yaml
- _upgrade-398-pipelines/pyspark-pipeline/src/pysparkpipeline/resources/apps/pyspark-pipeline-base-values.yaml
Update the build-parent version in the root pom.xml to 1.10.0-SNAPSHOT

Update the smallrye-reactive-messaging-kafka dependency to look like the following within upgrade-398-pipelines/spark-pipeline/pom.xml (Workaround from #263 bug):

    <dependency>
        <groupId>io.smallrye.reactive</groupId>
        <artifactId>smallrye-reactive-messaging-kafka</artifactId>
        <version>${version.smallrye.reactive.messaging}</version>
    </dependency>

Set your Java version to 17 if it is not currently
Run mvn org.technologybrewery.baton:baton-maven-plugin:baton-migrate (install would fail with compile errors because of missing migrations for #263.)
Verify the following values files have the delta jars updated to 3.2.1 and delta-core has been renamed to delta-spark:
- upgrade-398-pipelines/spark-pipeline/src/main/resources/apps/spark-pipeline-base-values.yaml
- _upgrade-398-pipelines/pyspark-pipeline/src/pysparkpipeline/resources/apps/pyspark-pipeline-base-values.yaml
Verify the dependency added in step 3 has changed from delta-core to delta-spark and the version is set to ${version.delta}