Feature: As a devops engineer, I want an aissemble-managed helm chart for the Hive metastore service that uses a newer version of Hive, so I have access to the latest security fixes.

Description

In order to improve usability and maintainability, we will be migrating to a v2 chart for the hive metastore service, keeping a similar usage pattern to that seen in #103. This ticket will encompass #116 as well, to update the underlying Hive metastore version.

Definition of Done

[x] Update hive-metastore-service docker image to use Hive 4.0.0
[x] Validate that the current v2 hive-metastore-service helm chart functions as expected
- [x] If not, make necessary updates to ensure functionality.
- [x] Refactor chart to live under extensions-helm-spark-infrastructure
[x] Update generated values/Chart file in downstream projects using the v2 profile with sensible defaults

Test Strategy/Script

Generate a new project using the following command:

mvn archetype:generate -B -DarchetypeGroupId=com.boozallen.aissemble \
                      -DarchetypeArtifactId=foundation-archetype \
                      -DarchetypeVersion=1.8.0-SNAPSHOT \
                      -DartifactId=test-project\
                      -DgroupId=org.test \
                      -DprojectName='Test' \
                      -DprojectGitUrl=test.org/test-project\
&& cd test-project

Add the following pipeline to test-project-pipeline-models/src/main/resources/pipelines/

{
"name": "PysparkPersist",
"package": "com.boozallen",
"type": {
"name": "data-flow",
"implementation": "data-delivery-pyspark"
},
"steps": [
{
  "name": "PersistData",
  "type": "synchronous",
  "persist": {
    "type": "hive"
  }
}
]
}

Add the following record to test-project-pipeline-models/src/main/resources/records/

{
"name": "CustomRecord",
"package": "com.boozallen.aiops.mda.pattern.record",
"description": "Example custom record for Pyspark Data Delivery Patterns",
"fields": [
{
  "name": "customField",
  "type": {
    "name": "customType",
    "package": "com.boozallen.aiops.mda.pattern.dictionary"
  }
}
]
}

Add the following dictionary to test-project-pipeline-models/src/main/resources/dictionaries/

{
"name": "PysparkDataDeliveryDictionary",
"package": "com.boozallen.aiops.mda.pattern.dictionary",
"dictionaryTypes": [
{
  "name": "customType",
  "simpleType": "string"
}
]
}

Execute mvn clean install -Dmaven.build.cache.skipCache=true repeatedly, resolving all presented manual actions until none remain.
Within test-project-deploy/pom.xml, replace aissemble-spark-infrastructure-deploy with aissemble-spark-infrastructure-deploy-v2
Delete the directory test-project-deploy/src/main/resources/apps/spark-infrastructure
Delete all references to hive-metastore-service from your Tiltfile
Within test-project-pipelines/test-project-data-access/src/main/resources/application.properties, set quarkus.datasource.jdbc.url to jdbc:hive2://spark-infrastructure-sts-service:10001/default;transportMode=http;httpPath=cliservice

Within test-project-pipelines/pyspark-persist/src/pyspark_persist/step/persist_data.py, define the implementation for execute_step_impl as follows:

def execute_step_impl(self) -> None:
    from ..record.custom_record import CustomRecord
    from ..schema.custom_record_schema import CustomRecordSchema
    custom_record = CustomRecord.from_dict({"customField": "foo"})
    record2 = CustomRecord.from_dict({"customField": "bar"})
    df = self.spark.createDataFrame(
        [
            custom_record,
            record2
        ],
        CustomRecordSchema().struct_type
    )
    self.save_dataset(df, "my_new_table")

Replace the contents of test-project-pipelines/pyspark-persist/src/pyspark_persist/resources/apps/pyspark-persist-dev-values.yaml with the following:

sparkApp:
spec:
  image: "test-project-spark-worker-docker:latest"
  sparkConf:
    spark.eventLog.enabled: "false"
    spark.sql.catalogImplementation: "hive"
    spark.eventLog.dir: "s3a://spark-infrastructure/spark-events"
    spark.hadoop.fs.s3a.endpoint: "http://s3-local:4566"
    spark.hadoop.fs.s3a.access.key: "123"
    spark.hadoop.fs.s3a.secret.key: "456"
    spark.hadoop.fs.s3.impl: "org.apache.hadoop.fs.s3a.S3AFileSystem"
    spark.hive.server2.thrift.port: "10000"
    spark.hive.server2.thrift.http.port: "10001"
    spark.hive.server2.transport.mode: "http"
    spark.hive.metastore.warehouse.dir: "s3a://spark-infrastructure/warehouse"
    spark.hadoop.fs.s3a.path.style.access: "true"
    spark.hive.server2.thrift.http.path: "cliservice"
    spark.hive.metastore.schema.verification: "false"
    spark.hive.metastore.uris: "thrift://hive-metastore-service:9083/default"
  driver:
    cores: 1
    memory: "2048m"
  executor:
    cores: 1
    memory: "2048m"

Execute mvn clean install -Dmaven.build.cache.skipCache=true once.

Use kubectl apply -f to apply the following yaml:

apiVersion: v1
kind: ConfigMap
metadata:
name: spark-config
data: {}

To avoid an unrelated bug, open your Tiltfile, and remove the entry for pipeline-invocation-service.
Execute tilt up
Once all resources are ready, trigger the pyspark-persist pipeline
Use kubectl get pods | grep data-access to get the name of the data access pod.
Use kubectl exec -it <DATA_ACCESS_POD_NAME> -- bash to enter the data access pod
Execute curl -X POST localhost:8080/graphql -H "Content-Type: application/json" -d '{ "query": "{ CustomRecord(table: \"my_new_table\") { customField } }" }' and ensure that data including two records is returned, ie: {"data":{"CustomRecord":[{"customField":null},{"customField":null}]}}
Note on step 19: If you don't get any values back, in a fresh prompt, execute kubectl get svc | grep sts. It can take a minute or two to provision the service.

boozallen / aissemble