awslabs / python-deequ

Python API for Deequ
Apache License 2.0
669 stars 131 forks source link

Distance analyzer for detecting feature drift with PyDeequ #164

Open thuber opened 8 months ago

thuber commented 8 months ago

Is your feature request related to a problem? Please describe. It looks like in PyDeequ 1.1.1 the Deequ distance analyzer is not available. It means that this type of analyzer cannot be run via PyDeequ.

Describe the solution you'd like I'd like to run the distance analysis on my data, so that I can detect feature drift via the 2 methods currently available in Deequ (L-infinity and chi-squared). I'd like to be able to do this in the same way as with the other analyzers.

Are there any plans to add this analyzer in the future?

Describe alternatives you've considered I've read the documentation, but couldn't find anything related to the distance analyzer.

As a hacky workaround, I experimented with invoking the numericalDistance() method directly with Py4J, but found that instantiating a Scala object of QuantileNonSample[Double] doesn't seem to be a straightforward thing to do in Python.

chenliu0831 commented 8 months ago

Mark as a feature request. In short term, we are focusing on Spark upgrades and bug fixes. Next year we may start looking at more parity.

chenliu0831 commented 8 months ago

Also it shouldn't be too hard to add a new analyzer - see existing pattern here https://github.com/awslabs/python-deequ/blob/e74e974e739f24cccb827a6378cd50e97697a0e8/pydeequ/analyzers.py#L191

peixeirodata commented 8 months ago

Maybe I can work on this to start contributing with the lib guys. @thuber (or maybe even @chenliu0831) can you give me an example using Deequ Distance Analyzer ?

thuber commented 8 months ago

Hi @peixeirodata, from what I can tell, usage of the distance analyzer isn't particularly well documented. The most useful reference I found is actually the unit tests for the class.

Some Pointers to Start with

If you want to look into it, I can provide some pointers of what I learned while looking into it.

As it turns out, the Distance.scala class doesn't follow the same pattern as the other analyzers, in that it just defines a regular Scala object that doesn't inherit any traits.

The method I would want to invoke for detecting feature drift of non-categorical features is numericalDistance(). This function has the following signature:

def numericalDistance(
    sample1: QuantileNonSample[Double],
    sample2: QuantileNonSample[Double],
    correctForLowNumberOfSamples: Boolean = false,
    alpha: Option[Double] = None
): Double

So in order to invoke this method, we would need to create two instances of QuantileNonSample[Double] in Python, one for the reference data set and one for the current data set. The signature of the constructor of the class is:

class QuantileNonSample[T](
    var sketchSize: Int,
    var shrinkingFactor: Double = 0.64)
    (implicit ordering: Ordering[T], ct: ClassTag[T]
) extends Serializable

Here is where I stopped looking into it, because the way I read it, we would have to somehow instantiate two instances of this Scala class in Python using a library called Py4j. The challenges I see are:

  1. How to instantiate a Scala class that uses generics using Py4j
  2. How to pass implicit parameters
  3. How to obtain and pass the ordering and class tag of the type parameter

These questions may have obvious answers, but I am by no means an expert on how to do the above using Py4j and am hence not able to easily figure out the answers. Especially since it seems like Py4j wasn't really designed to support more advanced Scala concepts like class tags.

peixeirodata commented 8 months ago

Hi @thuber, Thank you for your observations! I started to create some implementation to the numericalDistance and categoricalDistance here (fell free to tell me your thoughts about it).

As this Distance class in Scala is quite different from the other analyzers, I'd be glad if someone could provide me an example of these methods just to try to use the same approach when testing the Python version.

Edit: I got your point telling me to take a look at the unit tests. I understood that the Distance is not used with the addAnalyzer method. I'll test locally my implementation. Thanks

chenliu0831 commented 7 months ago

@peixeirodata feel free to open a draft PR for discussion