boozallen / aissemble

Booz Allen's lean manufacturing approach for holistically designing, developing and fielding AI solutions across the engineering lifecycle from data processing to model building, tuning, and training to secure operational deployment
Other
29 stars 7 forks source link

Feature: As a devops engineer, I want an aissemble-managed helm chart for the Hive metastore service that uses a newer version of Hive, so I have access to the latest security fixes. #127

Closed peter-mcclonski closed 2 months ago

peter-mcclonski commented 2 months ago

Description

In order to improve usability and maintainability, we will be migrating to a v2 chart for the hive metastore service, keeping a similar usage pattern to that seen in #103. This ticket will encompass #116 as well, to update the underlying Hive metastore version.

Definition of Done

Test Strategy/Script

  1. Generate a new project using the following command:
    mvn archetype:generate -B -DarchetypeGroupId=com.boozallen.aissemble \
                          -DarchetypeArtifactId=foundation-archetype \
                          -DarchetypeVersion=1.8.0-SNAPSHOT \
                          -DartifactId=test-project\
                          -DgroupId=org.test \
                          -DprojectName='Test' \
                          -DprojectGitUrl=test.org/test-project\
    && cd test-project
  2. Add the following pipeline to test-project-pipeline-models/src/main/resources/pipelines/
    {
    "name": "PysparkPersist",
    "package": "com.boozallen",
    "type": {
    "name": "data-flow",
    "implementation": "data-delivery-pyspark"
    },
    "steps": [
    {
      "name": "PersistData",
      "type": "synchronous",
      "persist": {
        "type": "hive"
      }
    }
    ]
    }
  3. Add the following record to test-project-pipeline-models/src/main/resources/records/
    {
    "name": "CustomRecord",
    "package": "com.boozallen.aiops.mda.pattern.record",
    "description": "Example custom record for Pyspark Data Delivery Patterns",
    "fields": [
    {
      "name": "customField",
      "type": {
        "name": "customType",
        "package": "com.boozallen.aiops.mda.pattern.dictionary"
      }
    }
    ]
    }
  4. Add the following dictionary to test-project-pipeline-models/src/main/resources/dictionaries/
    {
    "name": "PysparkDataDeliveryDictionary",
    "package": "com.boozallen.aiops.mda.pattern.dictionary",
    "dictionaryTypes": [
    {
      "name": "customType",
      "simpleType": "string"
    }
    ]
    }
  5. Execute mvn clean install -Dmaven.build.cache.skipCache=true repeatedly, resolving all presented manual actions until none remain.
  6. Within test-project-deploy/pom.xml, replace aissemble-spark-infrastructure-deploy with aissemble-spark-infrastructure-deploy-v2
  7. Delete the directory test-project-deploy/src/main/resources/apps/spark-infrastructure
  8. Delete all references to hive-metastore-service from your Tiltfile
  9. Within test-project-pipelines/test-project-data-access/src/main/resources/application.properties, set quarkus.datasource.jdbc.url to jdbc:hive2://spark-infrastructure-sts-service:10001/default;transportMode=http;httpPath=cliservice
  10. Within test-project-pipelines/pyspark-persist/src/pyspark_persist/step/persist_data.py, define the implementation for execute_step_impl as follows:
    def execute_step_impl(self) -> None:
        from ..record.custom_record import CustomRecord
        from ..schema.custom_record_schema import CustomRecordSchema
        custom_record = CustomRecord.from_dict({"customField": "foo"})
        record2 = CustomRecord.from_dict({"customField": "bar"})
        df = self.spark.createDataFrame(
            [
                custom_record,
                record2
            ],
            CustomRecordSchema().struct_type
        )
        self.save_dataset(df, "my_new_table")
  11. Replace the contents of test-project-pipelines/pyspark-persist/src/pyspark_persist/resources/apps/pyspark-persist-dev-values.yaml with the following:
    sparkApp:
    spec:
      image: "test-project-spark-worker-docker:latest"
      sparkConf:
        spark.eventLog.enabled: "false"
        spark.sql.catalogImplementation: "hive"
        spark.eventLog.dir: "s3a://spark-infrastructure/spark-events"
        spark.hadoop.fs.s3a.endpoint: "http://s3-local:4566"
        spark.hadoop.fs.s3a.access.key: "123"
        spark.hadoop.fs.s3a.secret.key: "456"
        spark.hadoop.fs.s3.impl: "org.apache.hadoop.fs.s3a.S3AFileSystem"
        spark.hive.server2.thrift.port: "10000"
        spark.hive.server2.thrift.http.port: "10001"
        spark.hive.server2.transport.mode: "http"
        spark.hive.metastore.warehouse.dir: "s3a://spark-infrastructure/warehouse"
        spark.hadoop.fs.s3a.path.style.access: "true"
        spark.hive.server2.thrift.http.path: "cliservice"
        spark.hive.metastore.schema.verification: "false"
        spark.hive.metastore.uris: "thrift://hive-metastore-service:9083/default"
      driver:
        cores: 1
        memory: "2048m"
      executor:
        cores: 1
        memory: "2048m"
  12. Execute mvn clean install -Dmaven.build.cache.skipCache=true once.
  13. Use kubectl apply -f to apply the following yaml:
    apiVersion: v1
    kind: ConfigMap
    metadata:
    name: spark-config
    data: {}
  14. To avoid an unrelated bug, open your Tiltfile, and remove the entry for pipeline-invocation-service.
  15. Execute tilt up
  16. Once all resources are ready, trigger the pyspark-persist pipeline
  17. Use kubectl get pods | grep data-access to get the name of the data access pod.
  18. Use kubectl exec -it <DATA_ACCESS_POD_NAME> -- bash to enter the data access pod
  19. Execute curl -X POST localhost:8080/graphql -H "Content-Type: application/json" -d '{ "query": "{ CustomRecord(table: \"my_new_table\") { customField } }" }' and ensure that data including two records is returned, ie: {"data":{"CustomRecord":[{"customField":null},{"customField":null}]}}
  20. Note on step 19: If you don't get any values back, in a fresh prompt, execute kubectl get svc | grep sts. It can take a minute or two to provision the service.

References/Additional Context

Cho-William commented 2 months ago

OTS completed

csun-cpointe commented 2 months ago

final test passed!!