databricks / reference-apps

Spark reference applications
Other
656 stars 341 forks source link

Improvement to the Log Analyzer SQL example #77

Open evarga opened 8 years ago

evarga commented 8 years ago

The following code can be improved by better leveraging SQL:

// Calculate statistics based on the content size.
Tuple4<Long, Long, Long, Long> contentSizeStats =
    sqlContext.sql("SELECT SUM(contentSize), COUNT(*), MIN(contentSize), MAX(contentSize) FROM logs")
        .map(row -> new Tuple4<>(row.getLong(0), row.getLong(1), row.getLong(2), row.getLong(3)))
        .first();
System.out.println(String.format("Content Size Avg: %s, Min: %s, Max: %s",
    contentSizeStats._1() / contentSizeStats._2(),
    contentSizeStats._3(),
    contentSizeStats._4()));

Namely, SQL already suppports calculating an average via the AVG function. Therefore, the improved code snippet may look like as follows:

// Calculate statistics based on the content size.
Tuple3<Double, Long, Long> contentSizeStats =
    sqlContext.sql("SELECT AVG(contentSize), MIN(contentSize), MAX(contentSize) FROM logs")
        .map(row -> new Tuple3<>(row.getDouble(0), row.getLong(1), row.getLong(2)))
        .first();
System.out.println(String.format("Content Size Avg: %s, Min: %s, Max: %s",
    contentSizeStats._1(),
    contentSizeStats._2(),
    contentSizeStats._3()));