Open thuber opened 8 months ago
Mark as a feature request. In short term, we are focusing on Spark upgrades and bug fixes. Next year we may start looking at more parity.
Also it shouldn't be too hard to add a new analyzer - see existing pattern here https://github.com/awslabs/python-deequ/blob/e74e974e739f24cccb827a6378cd50e97697a0e8/pydeequ/analyzers.py#L191
Maybe I can work on this to start contributing with the lib guys. @thuber (or maybe even @chenliu0831) can you give me an example using Deequ Distance Analyzer ?
Hi @peixeirodata, from what I can tell, usage of the distance analyzer isn't particularly well documented. The most useful reference I found is actually the unit tests for the class.
If you want to look into it, I can provide some pointers of what I learned while looking into it.
As it turns out, the Distance.scala class doesn't follow the same pattern as the other analyzers, in that it just defines a regular Scala object that doesn't inherit any traits.
The method I would want to invoke for detecting feature drift of non-categorical features is numericalDistance(). This function has the following signature:
def numericalDistance(
sample1: QuantileNonSample[Double],
sample2: QuantileNonSample[Double],
correctForLowNumberOfSamples: Boolean = false,
alpha: Option[Double] = None
): Double
So in order to invoke this method, we would need to create two instances of QuantileNonSample[Double]
in Python, one for the reference data set and one for the current data set. The signature of the constructor of the class is:
class QuantileNonSample[T](
var sketchSize: Int,
var shrinkingFactor: Double = 0.64)
(implicit ordering: Ordering[T], ct: ClassTag[T]
) extends Serializable
Here is where I stopped looking into it, because the way I read it, we would have to somehow instantiate two instances of this Scala class in Python using a library called Py4j. The challenges I see are:
These questions may have obvious answers, but I am by no means an expert on how to do the above using Py4j and am hence not able to easily figure out the answers. Especially since it seems like Py4j wasn't really designed to support more advanced Scala concepts like class tags.
Hi @thuber, Thank you for your observations! I started to create some implementation to the numericalDistance and categoricalDistance here (fell free to tell me your thoughts about it).
As this Distance class in Scala is quite different from the other analyzers, I'd be glad if someone could provide me an example of these methods just to try to use the same approach when testing the Python version.
Edit: I got your point telling me to take a look at the unit tests. I understood that the Distance is not used with the addAnalyzer method. I'll test locally my implementation. Thanks
@peixeirodata feel free to open a draft PR for discussion
Is your feature request related to a problem? Please describe. It looks like in PyDeequ 1.1.1 the Deequ distance analyzer is not available. It means that this type of analyzer cannot be run via PyDeequ.
Describe the solution you'd like I'd like to run the distance analysis on my data, so that I can detect feature drift via the 2 methods currently available in Deequ (L-infinity and chi-squared). I'd like to be able to do this in the same way as with the other analyzers.
Are there any plans to add this analyzer in the future?
Describe alternatives you've considered I've read the documentation, but couldn't find anything related to the distance analyzer.
As a hacky workaround, I experimented with invoking the numericalDistance() method directly with Py4J, but found that instantiating a Scala object of QuantileNonSample[Double] doesn't seem to be a straightforward thing to do in Python.