Closed gerileka closed 9 months ago
I would say a solution will be like the following that existed in previous versions:
def hasPattern(self, column, pattern, assertion=None, name=None, hint=None):
"""
Checks for pattern compliance. Given a column name and a regular expression, defines a
Check on the average compliance of the column's values to the regular expression.
:param str column: Column in DataFrame to be checked
:param Regex pattern: A name that summarizes the current check and the
metrics for the analysis being done.
:param lambda assertion: A function with an int or float parameter.
:param str name: A name for the pattern constraint.
:param str hint: A hint that states why a constraint could have failed.
:return: hasPattern self: A Check object that runs the condition on the column.
"""
assertion_func = ScalaFunction1(self._spark_session.sparkContext._gateway, assertion if assertion else lambda x: x == 1)
name = self._jvm.scala.Option.apply(name)
hint = self._jvm.scala.Option.apply(hint)
pattern_regex = self._jvm.scala.util.matching.Regex(pattern, None)
self._Check = self._Check.hasPattern(column, pattern_regex, assertion_func, name, hint)
return self
I just merged #66 which should address this. Pending CI to pass on master and feel free to test again
I just merged #66 which should address this. Pending CI to pass on master and feel free to test again
Hello, thanks for your quick response.
I get this error now when I use the new implementation @chenliu0831 :
---------------------------------------------------------------------------
Py4JError Traceback (most recent call last)
Cell In[17], line 6
1 check = Check(spark, CheckLevel.Warning, "Review Check")
3 checkResult = (VerificationSuite(spark)
4 .onData(orders_reference_mock)
5 .addCheck(
----> 6 check
7 .hasPattern(column = "concept_id", pattern="[0-9a-fA-F]")
8 .isUnique("id")
9 .hasPattern(column = "id", pattern=r"[0-9a-fA-F]{8}\-[0-9a-fA-F]{4}\-[0-9a-fA-F]{4}\-[0-9a-fA-F]{4}\-[0-9a-fA-F]{12}")
10 .hasMin("gtv", lambda x: x == 30.0)
11 .hasMax("gtv", lambda x: x == 50.0)
12 )
13 .run())
15 checkResult_df = VerificationResult.checkResultsAsDataFrame(spark, checkResult)
File ~/.local/lib/python3.10/site-packages/pydeequ/checks.py:568, in Check.hasPattern(self, column, pattern, assertion, name, hint)
554 def hasPattern(self, column, pattern, assertion=None, name=None, hint=None):
555 """
556 Checks for pattern compliance. Given a column name and a regular expression, defines a
557 Check on the average compliance of the column's values to the regular expression.
(...)
565 :return: hasPattern self: A Check object that runs the condition on the column.
566 """
567 assertion_func = ScalaFunction1(self._spark_session.sparkContext._gateway, assertion) if assertion \
--> 568 else getattr(self._Check, "hasPattern$default$2")()
569 name = self._jvm.scala.Option.apply(name)
570 hint = self._jvm.scala.Option.apply(hint)
File /pyenv/versions/3.10.11/lib/python3.10/site-packages/py4j/java_gateway.py:1321, in JavaMember.__call__(self, *args)
1315 command = proto.CALL_COMMAND_NAME +\
1316 self.command_header +\
1317 args_command +\
1318 proto.END_COMMAND_PART
1320 answer = self.gateway_client.send_command(command)
-> 1321 return_value = get_return_value(
1322 answer, self.gateway_client, self.target_id, self.name)
1324 for temp_arg in temp_args:
1325 temp_arg._detach()
File /pyenv/versions/3.10.11/lib/python3.10/site-packages/pyspark/sql/utils.py:190, in capture_sql_exception.<locals>.deco(*a, **kw)
188 def deco(*a: Any, **kw: Any) -> Any:
189 try:
--> 190 return f(*a, **kw)
191 except Py4JJavaError as e:
192 converted = convert_exception(e.java_exception)
File /pyenv/versions/3.10.11/lib/python3.10/site-packages/py4j/protocol.py:330, in get_return_value(answer, gateway_client, target_id, name)
326 raise Py4JJavaError(
327 "An error occurred while calling {0}{1}{2}.\n".
328 format(target_id, ".", name), value)
329 else:
--> 330 raise Py4JError(
331 "An error occurred while calling {0}{1}{2}. Trace:\n{3}\n".
332 format(target_id, ".", name, value))
333 else:
334 raise Py4JError(
335 "An error occurred while calling {0}{1}{2}".
336 format(target_id, ".", name))
Py4JError: An error occurred while calling o122.hasPattern$default$2. Trace:
py4j.Py4JException: Method hasPattern$default$2([]) does not exist
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
at py4j.Gateway.invoke(Gateway.java:274)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
at java.base/java.lang.Thread.run(Unknown Source)
FYI : AnalysisRunner
works well tho, thank you
Oh nevermind apparently assertion needs to be setted:
.hasPattern(column = "concept_id",
pattern="[0-9a-fA-F]{8}\-[0-9a-fA-F]{4}\-[0-9a-fA-F]{4}\-[0-9a-fA-F]{4}\-[0-9a-fA-F]{12}",
assertion=lambda x: x == 1/1)
@gerileka nice! the error message seems obscure in that case.. like a red herring. I will start planning the next release this weekend
@chenliu0831 The pull request has been merged. Do you think a new tag will be created soon to generate a new version on PyPI?
Yes, this seems a important bug-fix. Doing release now.
Released to PYPI - https://pypi.org/project/pydeequ/1.1.1/. Closing
I am trying to follow this tutorial using the master version of the package.
https://github.com/awslabs/python-deequ/blob/aff4be66d09ceb7b1ff1b41c1a98fec509a35c03/tutorials/hasPattern_check.ipynb#L14
Running the following line spits the following problem:
This comes from the fact that
hasPattern
is really empty as a function. Is this function supported anymore ?https://github.com/awslabs/python-deequ/blob/aff4be66d09ceb7b1ff1b41c1a98fec509a35c03/pydeequ/checks.py#L554