Booz Allen's lean manufacturing approach for holistically designing, developing and fielding AI solutions across the engineering lifecycle from data processing to model building, tuning, and training to secure operational deployment
Other
34
stars
8
forks
source link
Feature: Java 17 Upgrade Support Series - Pipeline persistence #398
In #356 we upgraded the bulk of Spark pipeline functionality to Java 17. This tackles the follow-on work of updating the persistence logic of pipeline steps. The available persistence types are:
hive
delta-lake
rdbms
Note: this used to be the postgres option but was updated to be rdbms. However, because many of the Java configs still refer to postgres it's unclear if this was tested with anything other than postgres.
Update: Java's postgres and rdbms options are separate but entangled so neither works
elasticsearch
neo4j
DOD
Acceptance criteria required to realize the requested feature
[x] A pipeline can write to and read from the following data sources while running on the new Java 17 Spark Worker:
[x] hive
[x] delta-lake
~rdbms~
~MySql (fallback to Postgres if RDBMS hasn't been properly generalized)~
~Clean up config names (if time allows)~
---- stretch goal ----
[x] elasticsearch
Did not test reading from ES as I couldn't quite figure out the syntax and it was a low-priority item
[x] neo4j
Note: There is a pre-existing issue with Neo4j that prevents reading data from a table. The data can be saved, just not read.
Fully generate the project by running mvn clean install and following manual actions
Unzip the pipeline step code in the root of the project to add it to the pipeline. This logic passes a sample dataset between each storage format appending new data each time.
Build the project without the cache and follow the last manual action.
mvn clean install -Dmaven.build.cache.skipCache
Deploy the project.
tilt up; tilt down
Once all the resources have started (may take awhile) execute the pipeline. (It will be persist-pipeline in Tilt if you've used the provided model.)
Verify that the pipeline completes successfully.
Verify that a dataset was successfully written to each persistence type. _Tip: The logs are quite busy so filtering by "StepBase:" via the Tilt UI will help you find these lines much faster._
...
24/10/11 15:35:14 INFO HiveStepBase: Saved HiveStep to Hive
...
24/10/11 15:35:34 INFO DeltaLakeStepBase: Saved DeltaLakeStep to Delta Lake
...
24/10/11 15:36:46 INFO Neo4jStepBase: Saved Neo4jStep to Neo4j
...
24/10/11 15:36:48 INFO ElasticsearchStepBase: Saved ElasticsearchStep to Elasticsearch
Update the build-parent version in the root pom.xml to 1.10.0-SNAPSHOT
Update the smallrye-reactive-messaging-kafka dependency to look like the following within upgrade-398-pipelines/spark-pipeline/pom.xml (Workaround from #263 bug):
Description
In #356 we upgraded the bulk of Spark pipeline functionality to Java 17. This tackles the follow-on work of updating the persistence logic of pipeline steps. The available persistence types are:
postgres
option but was updated to berdbms
. However, because many of the Java configs still refer to postgres it's unclear if this was tested with anything other than postgres.postgres
andrdbms
options are separate but entangled so neither worksDOD
Acceptance criteria required to realize the requested feature
Test Strategy/Script
Main Test
mvn clean install
and following manual actionspersist-pipeline
in Tilt if you've used the provided model.)Upgrade Test
pom.xml
delta-lake
mvn clean install
and following manual actionssparkApp.spec.deps.packages
:build-parent
version in the rootpom.xml
to 1.10.0-SNAPSHOTsmallrye-reactive-messaging-kafka
dependency to look like the following withinupgrade-398-pipelines/spark-pipeline/pom.xml
(Workaround from #263 bug):mvn org.technologybrewery.baton:baton-maven-plugin:baton-migrate
(install would fail with compile errors because of missing migrations for #263.)${version.delta}