Closed joshuazexter closed 4 months ago
Can we add a unit test that shows the usage of this analyzer along with other analyzers? See ColumnProfilerRunner
and this readme
Great PR description! Can you also add the output of the println
statements ?
This pull request introduces the CustomAggregator, a tool designed for dynamic data aggregation based on user-specified conditions within Apache Spark DataFrames. This addition can preform customized metric calculations and aggregations, making it applicable where conditional data aggregation is required.
Core Features:
How It Can Be Used: To use the CustomAggregator, developers will need to:
Usage Examples: Included in the pull request are unit tests that demonstrate potential use cases:
Content Engagement Metrics:
val analyzer = CustomAggregator(contentEngagementLambda, "ContentEngagement", "AllPlatforms")
val data = session.read.format("csv").option("header", "true").load("path_to_data_file") val state = analyzer.computeStateFrom(data) val metric = analyzer.computeMetricFrom(state)
println("Content Engagement Metrics: " + metric.value.get) // Content Engagement Metrics: Map(Video -> 0.81, Article -> 0.18)
val resourceUtilizationLambda: DataFrame => AggregatedMetricState = df => { val totalResources = df.groupBy("service_type") .agg( ((sum("cpu_hours") + sum("memory_gbs") + sum("storage_gbs")).cast("int") / df.count()).alias("percentageResources") ) .collect() .map(row => row.getString(0) -> row.getDouble(1) ) .toMap val totalSum = totalResources.values.sum AggregatedMetricState(resourceUtilizationLambda, totalSum.toInt) }
val analyzer = CustomAggregator(resourceUtilizationLambda, "ResourceUtilization", "CloudServices")
val data = session.read.format("csv").option("header", "true").load("path_to_usage_data_file") val state = analyzer.computeStateFrom(data) val metric = analyzer.computeMetricFrom(state)
println("Resource Utilization Metrics: " + metric.value.get) // Resource Utilization Metrics: Map(Compute -> 0.51, Database -> 0.27, Storage -> 0.21)