awslabs / python-deequ

Python API for Deequ
Apache License 2.0
713 stars 134 forks source link

hasNumberOfDistinctValues and hasHistogramValues require arguments that should be optional? #81

Open thvasilo opened 2 years ago

thvasilo commented 2 years ago

Describe the bug The docs and Scala code for hasNumberOfDistinctValues and hasHistogramValues indicate that providing the binningUdf, maxBins parameters should be optional, but from the function definitions they seem to be required.

To Reproduce Steps to reproduce the behavior:

  1. Try to define a check with check.hasNumberOfDistinctValues('column_name', lambda x: x == 6)
  2. See error: TypeError: hasNumberOfDistinctValues() missing 2 required positional arguments: 'binningUdf' and 'maxBins'

Expected behavior I'd like to be able to call the hasNumberOfDistinctValues and hasHistogramValues without specifying a binning function and maxBins.

brunoRenzo6 commented 2 years ago

While these params are still defined as required... When it comes to binningUdf you could simply set it as None.


check = Check(spark, CheckLevel.Warning, "test hasHistogramValues")
result = (VerificationSuite(spark).onData(df)
          .addCheck(check
                    .hasHistogramValues("c_1",
                                             lambda x: x.apply("66").absolute() > 4500000, None, 2)
                    .hasHistogramValues("c_2",
                                             lambda x: x.apply("22").ratio() > 0.5, None, 2))

         ).run()
thvasilo commented 2 years ago

Thanks @brunoRenzo6, adding this to the method's docs could help.