awslabs / python-deequ

Python API for Deequ
Apache License 2.0
669 stars 131 forks source link

hasPattern I think is broken #152

Closed gerileka closed 9 months ago

gerileka commented 10 months ago

I am trying to follow this tutorial using the master version of the package.

https://github.com/awslabs/python-deequ/blob/aff4be66d09ceb7b1ff1b41c1a98fec509a35c03/tutorials/hasPattern_check.ipynb#L14

Running the following line spits the following problem:

        check.hasPattern(column='email',
                         pattern=r".*@baz.com",
                         assertion=lambda x: x == 1/3)
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[37], line 9
      1 check = Check(spark, CheckLevel.Error, "Integrity checks")
      3 checkResult = VerificationSuite(spark) \
      4     .onData(df) \
      5     .addCheck(
      6         check.hasPattern(column='email',
      7                          pattern=r".*@baz.com",
      8                          assertion=lambda x: x == 1/3) \
----> 9         .hasPattern(column='a',
     10                          pattern=r"ba(r|z)",
     11                          assertion=lambda x: x == 0/3) \
     12         .hasPattern(column='email',
     13                      pattern=r"""(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])""",
     14                      assertion=lambda x: x == 1.0)) \
     15     .run()

AttributeError: 'NoneType' object has no attribute 'hasPattern'

This comes from the fact that hasPattern is really empty as a function. Is this function supported anymore ?

https://github.com/awslabs/python-deequ/blob/aff4be66d09ceb7b1ff1b41c1a98fec509a35c03/pydeequ/checks.py#L554

gerileka commented 10 months ago

I would say a solution will be like the following that existed in previous versions:

    def hasPattern(self, column, pattern, assertion=None, name=None, hint=None):
        """
        Checks for pattern compliance. Given a column name and a regular expression, defines a
        Check on the average compliance of the column's values to the regular expression.

        :param str column: Column in DataFrame to be checked
        :param Regex pattern: A name that summarizes the current check and the
                metrics for the analysis being done.
        :param lambda assertion: A function with an int or float parameter.
        :param str name: A name for the pattern constraint.
        :param str hint: A hint that states why a constraint could have failed.
        :return: hasPattern self: A Check object that runs the condition on the column.
        """
        assertion_func = ScalaFunction1(self._spark_session.sparkContext._gateway, assertion if assertion else lambda x: x == 1)
        name = self._jvm.scala.Option.apply(name)
        hint = self._jvm.scala.Option.apply(hint)
        pattern_regex = self._jvm.scala.util.matching.Regex(pattern, None)
        self._Check = self._Check.hasPattern(column, pattern_regex, assertion_func, name, hint)
        return self
chenliu0831 commented 10 months ago

I just merged #66 which should address this. Pending CI to pass on master and feel free to test again

gerileka commented 10 months ago

I just merged #66 which should address this. Pending CI to pass on master and feel free to test again

Hello, thanks for your quick response.

I get this error now when I use the new implementation @chenliu0831 :

---------------------------------------------------------------------------
Py4JError                                 Traceback (most recent call last)
Cell In[17], line 6
      1 check = Check(spark, CheckLevel.Warning, "Review Check")
      3 checkResult = (VerificationSuite(spark) 
      4     .onData(orders_reference_mock) 
      5     .addCheck(
----> 6         check 
      7         .hasPattern(column = "concept_id", pattern="[0-9a-fA-F]")
      8         .isUnique("id")
      9         .hasPattern(column = "id", pattern=r"[0-9a-fA-F]{8}\-[0-9a-fA-F]{4}\-[0-9a-fA-F]{4}\-[0-9a-fA-F]{4}\-[0-9a-fA-F]{12}")
     10         .hasMin("gtv", lambda x: x == 30.0) 
     11         .hasMax("gtv", lambda x: x == 50.0) 
     12     )
     13     .run())
     15 checkResult_df = VerificationResult.checkResultsAsDataFrame(spark, checkResult)

File ~/.local/lib/python3.10/site-packages/pydeequ/checks.py:568, in Check.hasPattern(self, column, pattern, assertion, name, hint)
    554 def hasPattern(self, column, pattern, assertion=None, name=None, hint=None):
    555     """
    556     Checks for pattern compliance. Given a column name and a regular expression, defines a
    557     Check on the average compliance of the column's values to the regular expression.
   (...)
    565     :return: hasPattern self: A Check object that runs the condition on the column.
    566     """
    567     assertion_func = ScalaFunction1(self._spark_session.sparkContext._gateway, assertion) if assertion \
--> 568         else getattr(self._Check, "hasPattern$default$2")()
    569     name = self._jvm.scala.Option.apply(name)
    570     hint = self._jvm.scala.Option.apply(hint)

File /pyenv/versions/3.10.11/lib/python3.10/site-packages/py4j/java_gateway.py:1321, in JavaMember.__call__(self, *args)
   1315 command = proto.CALL_COMMAND_NAME +\
   1316     self.command_header +\
   1317     args_command +\
   1318     proto.END_COMMAND_PART
   1320 answer = self.gateway_client.send_command(command)
-> 1321 return_value = get_return_value(
   1322     answer, self.gateway_client, self.target_id, self.name)
   1324 for temp_arg in temp_args:
   1325     temp_arg._detach()

File /pyenv/versions/3.10.11/lib/python3.10/site-packages/pyspark/sql/utils.py:190, in capture_sql_exception.<locals>.deco(*a, **kw)
    188 def deco(*a: Any, **kw: Any) -> Any:
    189     try:
--> 190         return f(*a, **kw)
    191     except Py4JJavaError as e:
    192         converted = convert_exception(e.java_exception)

File /pyenv/versions/3.10.11/lib/python3.10/site-packages/py4j/protocol.py:330, in get_return_value(answer, gateway_client, target_id, name)
    326         raise Py4JJavaError(
    327             "An error occurred while calling {0}{1}{2}.\n".
    328             format(target_id, ".", name), value)
    329     else:
--> 330         raise Py4JError(
    331             "An error occurred while calling {0}{1}{2}. Trace:\n{3}\n".
    332             format(target_id, ".", name, value))
    333 else:
    334     raise Py4JError(
    335         "An error occurred while calling {0}{1}{2}".
    336         format(target_id, ".", name))

Py4JError: An error occurred while calling o122.hasPattern$default$2. Trace:
py4j.Py4JException: Method hasPattern$default$2([]) does not exist
    at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
    at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
    at py4j.Gateway.invoke(Gateway.java:274)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
    at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
    at java.base/java.lang.Thread.run(Unknown Source)

FYI : AnalysisRunner works well tho, thank you

gerileka commented 10 months ago

Oh nevermind apparently assertion needs to be setted:

    .hasPattern(column = "concept_id",  
          pattern="[0-9a-fA-F]{8}\-[0-9a-fA-F]{4}\-[0-9a-fA-F]{4}\-[0-9a-fA-F]{4}\-[0-9a-fA-F]{12}",
          assertion=lambda x: x == 1/1)
chenliu0831 commented 10 months ago

@gerileka nice! the error message seems obscure in that case.. like a red herring. I will start planning the next release this weekend

mouadhelfekih commented 9 months ago

@chenliu0831 The pull request has been merged. Do you think a new tag will be created soon to generate a new version on PyPI?

chenliu0831 commented 9 months ago

Yes, this seems a important bug-fix. Doing release now.

https://github.com/awslabs/python-deequ/pull/155

chenliu0831 commented 9 months ago

Released to PYPI - https://pypi.org/project/pydeequ/1.1.1/. Closing