ShopRunner / baleen

Kotlin DSL for validating data (JSON, XML, CSV, Avro)
BSD 3-Clause "New" or "Revised" License
16 stars 5 forks source link

Support tests against entire datasets #217

Open kdallmeyer-sr opened 3 years ago

kdallmeyer-sr commented 3 years ago

A common data validation task is to compare summary statistics. For example, whether the expectation is that the average price of a product is near certain defined mean or in a defined std dev range. Likewise someone might want to compare averages across two attributes.

Typically this would be done in SQL after the data is at rest. It would be great if Baleen supported both out-of-box statistical aggregations like sum,mean,variance, stddev, min, max, and count. Likewise, Baleen should support custom aggregations that can be done across large datasets.

"Product".describeAs {
    "sku".type(StringType())

    "price".type(FloatType(min = 0.0f))

    "salePrice".type(FloatType(min = 0.0f)).describe {
        // Add warning when sale discount > $5
        it.test { dataTrace, data -> 
            if (data["price"].toFloat() - data["salePrice"].toFloat() > 5f) {
                sequenceOf(ValidationWarning(dataTrace, "sale price is more than 5 dollar difference")) 
            } else {
                sequenceOf()
            }
    }

     // Test Mean is higher that 50.
     dataSetTest { dataTrace, dataSet ->
          if(dataSet["price"].mapNotNull { it.toFloatOrNull() }.average() > 50.0f) {
                sequenceOf(ValidationError(dataTrace, "Department ($department) is not a valid value.", value))
          } else {
               sequenceOf()
          }
     }

     // Test Difference between Sale Price and Price is on average less than $5. 
     // This is different than the attribute test since we expect it to happen sometimes but not too much.
     dataSetTest { dataTrace, dataSet ->
          if(dataSet.map { dataRow -> dataRow["price"].toFloat() - dataRow["salePrice"].toFloat() }.average() > 5.0f) {
                sequenceOf(ValidationError(dataTrace, "sale discount is on average higher than $5 ", value))
          } else {
               sequenceOf()
          }
     }
}

Additionally I think it would be interesting to provide a utility function that calculates summary statistics across the entire dataset. Something like

dataSetTest { dataTrace, dataSet ->
    val summary = dataSet.summary()
    summary.forEach { attrSummary ->
        println("""
         # Numeric Summaries when attribute is a numeric value
        Average: ${attrSummary.average}
        Std Dev: ${attrSummary.stdDev}
        Count: ${attrSummary.count}
        Sum: ${attrSummary.sum}
        Min: ${attrSummary.min}
        Max: ${attrSummary.max}

        # String Summaries when attribute is a string
        MinLength: ${attrSummary.minLength}
        MaxLength: ${attrSummary.maxLength}
        AvgLength: ${attrSummary.avgLength}
        CountDistinct: ${attrSummary.countDistinct}

        # Occurence Summaries when attribute is an occurence
        MinLength: ${attrSummary.minLength}
        MaxLength: ${attrSummary.maxLength}
        AvgLength: ${attrSummary.avgLength}
        """.trimIndent())
    }
}
kdallmeyer-sr commented 3 years ago

An idea to solve this:

Collect these tests in it's own sequence or list that are returned to the topmost data description. Then if present, as it reads each data element it creates an internal map of datatrace to summary statistic for that value. For simplicity the summary map would be one giant flat one. Then apply the dataset tests to the summary map at the end.

This would mean that the CSV validator would have to be modified so that row number is no longer a part of the datatrace. Instead it should be a tag like Json and Xml validators.