Support tests against entire datasets

A common data validation task is to compare summary statistics. For example, whether the expectation is that the average price of a product is near certain defined mean or in a defined std dev range. Likewise someone might want to compare averages across two attributes.

Typically this would be done in SQL after the data is at rest. It would be great if Baleen supported both out-of-box statistical aggregations like sum,mean,variance, stddev, min, max, and count. Likewise, Baleen should support custom aggregations that can be done across large datasets.

"Product".describeAs {
    "sku".type(StringType())

    "price".type(FloatType(min = 0.0f))

    "salePrice".type(FloatType(min = 0.0f)).describe {
        // Add warning when sale discount > $5
        it.test { dataTrace, data -> 
            if (data["price"].toFloat() - data["salePrice"].toFloat() > 5f) {
                sequenceOf(ValidationWarning(dataTrace, "sale price is more than 5 dollar difference")) 
            } else {
                sequenceOf()
            }
    }

     // Test Mean is higher that 50.
     dataSetTest { dataTrace, dataSet ->
          if(dataSet["price"].mapNotNull { it.toFloatOrNull() }.average() > 50.0f) {
                sequenceOf(ValidationError(dataTrace, "Department ($department) is not a valid value.", value))
          } else {
               sequenceOf()
          }
     }

     // Test Difference between Sale Price and Price is on average less than $5. 
     // This is different than the attribute test since we expect it to happen sometimes but not too much.
     dataSetTest { dataTrace, dataSet ->
          if(dataSet.map { dataRow -> dataRow["price"].toFloat() - dataRow["salePrice"].toFloat() }.average() > 5.0f) {
                sequenceOf(ValidationError(dataTrace, "sale discount is on average higher than $5 ", value))
          } else {
               sequenceOf()
          }
     }
}

Additionally I think it would be interesting to provide a utility function that calculates summary statistics across the entire dataset. Something like

dataSetTest { dataTrace, dataSet ->
    val summary = dataSet.summary()
    summary.forEach { attrSummary ->
        println("""
         # Numeric Summaries when attribute is a numeric value
        Average: ${attrSummary.average}
        Std Dev: ${attrSummary.stdDev}
        Count: ${attrSummary.count}
        Sum: ${attrSummary.sum}
        Min: ${attrSummary.min}
        Max: ${attrSummary.max}

        # String Summaries when attribute is a string
        MinLength: ${attrSummary.minLength}
        MaxLength: ${attrSummary.maxLength}
        AvgLength: ${attrSummary.avgLength}
        CountDistinct: ${attrSummary.countDistinct}

        # Occurence Summaries when attribute is an occurence
        MinLength: ${attrSummary.minLength}
        MaxLength: ${attrSummary.maxLength}
        AvgLength: ${attrSummary.avgLength}
        """.trimIndent())
    }
}

ShopRunner / baleen

Support tests against entire datasets #217