awslabs / python-deequ

Python API for Deequ
Apache License 2.0
669 stars 131 forks source link

Enable Custom Analyzer in python by implementing pydeequ.analyzers._AnalyzerObject #186

Open WiktorMadejski opened 6 months ago

WiktorMadejski commented 6 months ago

Is your feature request related to a problem? Please describe. Enabling and showing the example of how to extend pydeequ.analyzers._AnalyzerObject to define custom Analyzer in python.

Describe the solution you'd like Be able to implement:

class MyCustomAnalyzer(_AnalyzerObject):
    """Get the maximum of a numeric column."""

    def __init__(self, column, my_property: str = None):
        """
        :param str column: column to find the maximum.
        :param str my_property: custom property
        """
        self.column = column
        self.my_property = my_property

    @property
    def _analyzer_jvm(self, foo: AnalyzerInput) -> AnalyzerOutput:
       # my custom transformation that transforms well defined AnalyzerInput into AnalyzerOutput
        bar: AnalyzerOutput = ...
        return bar

and then run it in VerificationSuite, ex:

results = (VerificationSuite(spark)
            .onData(df)
            .useRepository(repository)
            .saveOrAppendResult(ResultKey(spark, ResultKey.current_milli_time(), {'tag': 'my-tag'}))
            .addAnomalyCheck(OnlineNormalStrategy(
                        lowerDeviationFactor=0.01,
                        upperDeviationFactor=0.01,
                        ignoreStartPercentage=0.1,
                        ignoreAnomalies=False,
            ), MyCustomAnalyzer("column_name", my_property="yeey!")) 
            .run())

Describe alternatives you've considered When calculating Anomalies - every time I have a custom metrics (to focus attention - lets say Sum() / CountDistinct()) I build temporary table that has one row, ex:

|        value_unique_name          |
-----------------------------------
| <value of Sum() / CountDistinct() |

and than run Anomaly over pydeequ.analyzers.Sum (or Mean, ie. transformation that gives identity). Its best if those custom metrics have seperate pydeequ metrics repository to the source table.

Additional context In anybody hacked it in a better way than described in Describe alternatives you've considered let us know in the comments!