Booz Allen's lean manufacturing approach for holistically designing, developing and fielding AI solutions across the engineering lifecycle from data processing to model building, tuning, and training to secure operational deployment
Other
34
stars
8
forks
source link
Feature: Update data access tooling to better support distributed querying of big data #475
Currently data access makes use of a GraphQL Quarkus app for accessing data outside of your spark pipeline. GraphQL is not optimized for performing queries against large datasets stored in data lakes. For better performance when accessing your data lake data, GraphQL should be replaced with a tool specifically designed for querying large data lakes (e.g Trino).
DOD
[x] Implement Trino as deploy profile option for data access
[x] Create baseline Helm chart using the official Trino Helm chart as the parent
[x] Should include defaults for configuring the chart to use the hive connector
[x] Include helm chart unit tests
[x] When enabled, generate a deploy resource dependent on the aiSSEMBLE Trino Helm Chart
[x] Update antora docs to detail new data access option
[x] Current data access becomes a drop down with GraphQL and Trino pages
[x] Remove GraphQL antora docs and deprecate the fermenter profiles
Test Strategy/Script
OTS Only:
Within the aiSSEMBLE repo, run the following and verify it builds successfully:
Add the attached SparkPipeline.json to the test-475-pipeline-models/src/main/resources/pipelines/ directory
Add the attached PersonDictionary.json to the test-475-pipeline-models/src/main/resources/dictionaries/ directory
Add the attached Person.json to the test-475-pipeline-models/src/main/resources/records/ directory
Run mvn clean install until all the manual actions are complete
Add the following execution to the test-475-deploy/pom.xml:
<execution>
<id>trino</id>
<phase>generate-sources</phase>
<goals>
<goal>generate-sources</goal>
</goals>
<configuration>
<basePackage>com.test</basePackage>
<profile>data-access-trino-deploy-v2</profile>
<!-- The property variables below are passed to the Generation Context and utilized
to customize the deployment artifacts. -->
<propertyVariables>
<appName>trino</appName>
</propertyVariables>
</configuration>
</execution>
Add the following to the test-475-pipelines/spark-pipeline/src/main/java/com/test/TestSyncStep.java:
- `tilt down`
- Remove the following from `test-475-pipeline-models/src/main/resources/records/Person.json` on lines 5-7:
"dataAccess": {
"enabled": "false"
},
- Build the project once with `mvn clean install -Dmaven.build.cache.skipCache` and complete the manual actions
- Build the project once with `mvn clean install` and verify you see the following warnings about data-access deprecation:
/your/path/test-475/test-475-docker/test-475-data-access-docker/pom.xml:
The profile 'aissemble-data-access-docker' is deprecated, please replace all references to it.
/your/path/devRepos/test-475/test-475-deploy/pom.xml:
The profile 'data-access-deploy-v2' is deprecated, please replace all references to it.
Description
Currently data access makes use of a GraphQL Quarkus app for accessing data outside of your spark pipeline. GraphQL is not optimized for performing queries against large datasets stored in data lakes. For better performance when accessing your data lake data, GraphQL should be replaced with a tool specifically designed for querying large data lakes (e.g Trino).
DOD
Test Strategy/Script
OTS Only:
Create a downstream project:
Add the attached SparkPipeline.json to the
test-475-pipeline-models/src/main/resources/pipelines/
directoryAdd the attached PersonDictionary.json to the
test-475-pipeline-models/src/main/resources/dictionaries/
directoryAdd the attached Person.json to the
test-475-pipeline-models/src/main/resources/records/
directoryRun
mvn clean install
until all the manual actions are completeAdd the following execution to the
test-475-deploy/pom.xml
:Add the following to the
test-475-pipelines/spark-pipeline/src/main/java/com/test/TestSyncStep.java
:...
// TODO: Add your business logic here for this step!
logger.error("Implement executeStepImpl(..) or remove this pipeline step!");
logger.info("Saving Person to table People");
Person person = new Person();
person.setName("John Smith");
person.setAge(50);
PersonSchema personSchema = new PersonSchema();
List rows = Stream.of(person).map(PersonSchema::asRow).toList();
Dataset dataset = sparkSession.createDataFrame(rows, personSchema.getStructType());
saveDataset(dataset, "People");
logger.info("Completed saving to table People"); }
Run
mvn clean install -Dmaven.build.cache.skipCache
to get any remaining manual actionsOTS Only: The project will fail to build due to the new helm chart not being published yet
Update the
test-475-deploy/src/main/resources/apps/trino/Chart.yaml
with the following:repository: oci://ghcr.io/boozallen
repository: file://../../../../../../../aissemble/extensions/extensions-helm/aissemble-trino-chart
Continue the build with
mvn clean install -Dmaven.build.cache.skipCache -rf :test-475-deploy
Complete the manual actions and run
tilt up
Once all the resources are ready on the tilt ui, start the
spark-pipeline
resourceVerify you see the following log ouput in the pipeline:
Connect to Trino using the cli:
./trino --server http://localhost:8084
Run the following command to query the data:
Verify you get the following output:
Query 20241122_143943_00000_c3nss, FINISHED, 1 node Splits: 1 total, 1 done (100.00%) 2.65 [1 rows, 14B] [0 rows/s, 5B/s]
/your/path/test-475/test-475-pipelines/test-475-data-access/pom.xml: Data Access using GraphQL is deprecated, please see the latest documentation for details on using Trino for Data Access: https://boozallen.github.io/aissemble/aissemble/current/data-access-details.html
/your/path/test-475/test-475-docker/test-475-data-access-docker/pom.xml: The profile 'aissemble-data-access-docker' is deprecated, please replace all references to it.
/your/path/devRepos/test-475/test-475-deploy/pom.xml: The profile 'data-access-deploy-v2' is deprecated, please replace all references to it.