awslabs / python-deequ

Python API for Deequ
Apache License 2.0
669 stars 131 forks source link

How do I use the binningUdf parameter #117

Open ashildkummen opened 1 year ago

ashildkummen commented 1 year ago

I am not able to use the binningUdf parameter of the Histogram analyzer, it errors when performing this line, getting error message:

AttributeError                            Traceback (most recent call last)
/tmp/ipykernel_9110/4276621302.py in <cell line: 3>()
      6                     .addAnalyzer(Mean("star_rating")) \
----> 8                     .addAnalyzer(Histogram("star_rating", binningUdf= lambda x: x)) \
      9                     .run()
     10 

~/anaconda3/envs/python3/lib/python3.10/site-packages/pydeequ/analyzers.py in addAnalyzer(self, analyzer)
    132         """
    133         analyzer._set_jvm(self._jvm)
--> 134         _analyzer_jvm = analyzer._analyzer_jvm
    135         self._AnalysisRunBuilder.addAnalyzer(_analyzer_jvm)
    136         return self

~/anaconda3/envs/python3/lib/python3.10/site-packages/pydeequ/analyzers.py in _analyzer_jvm(self)
    460         return self._deequAnalyzers.Histogram(
    461             self.column,
--> 462             self._jvm.scala.Option.apply(self.binningUdf),
    463             self.maxDetailBins,
    464             self._jvm.scala.Option.apply(self.where),

~/anaconda3/envs/python3/lib/python3.10/site-packages/py4j/java_gateway.py in __call__(self, *args)
   1311 
   1312     def __call__(self, *args):
-> 1313         args_command, temp_args = self._build_args(*args)
   1314 
   1315         command = proto.CALL_COMMAND_NAME +\

~/anaconda3/envs/python3/lib/python3.10/site-packages/py4j/java_gateway.py in _build_args(self, *args)
   1281 
   1282         args_command = "".join(
-> 1283             [get_command_part(arg, self.pool) for arg in new_args])
   1284 
   1285         return args_command, temp_args

~/anaconda3/envs/python3/lib/python3.10/site-packages/py4j/java_gateway.py in <listcomp>(.0)
   1281 
   1282         args_command = "".join(
-> 1283             [get_command_part(arg, self.pool) for arg in new_args])
   1284 
   1285         return args_command, temp_args

~/anaconda3/envs/python3/lib/python3.10/site-packages/py4j/protocol.py in get_command_part(parameter, python_proxy_pool)
    296             command_part += ";" + interface
    297     else:
--> 298         command_part = REFERENCE_TYPE + parameter._get_object_id()
    299 
    300     command_part += "\n"

AttributeError: 'function' object has no attribute '_get_object_id'

I have tried using a simple lambda function that actually does no binning but returns its input as output:

.addAnalyzer(Histogram("star_rating",binningUdf=lambda x: x))

To Reproduce Steps to reproduce the behavior:

  1. Go to the tutorial on analyzers
  2. Scroll down to command 3 and add .addAnalyzer(Histogram("star_rating", binningUdf=lambda x: x))

Expected behavior I would expect it to work just as it works when I'm doing it without binningUdf (i.e. just .addAnalyzer(Histogram("star_rating"))) Some more documentation on how to use the binningUdf parameter would be great.

vishaalkapoor commented 1 year ago

I'm seeing the same error. I tried using a UDF as well, e.g.

from pyspark.sql.functions import udf
binningUdf = udf(lambda z: int(z), returnType=IntegerType())

Same error. Maybe something to do with using functions in general....

In any case, the workaround I'm going to use is to simply apply the UDF ahead of the Histogram method and apply the histogram to the dummy column.

df.withColumn("dummy", binningUdf(df['Column']))
chenliu0831 commented 1 year ago

Looks like a bug - @ashildkummen does vishaalkapoor's workaround work for you?