awslabs / python-deequ

Python API for Deequ
Apache License 2.0
691 stars 132 forks source link

Modify 'PatternMatch' function in analyzers.py and 'hasPattern' function in checks.py #66

Closed jasonlin0189impv closed 11 months ago

jasonlin0189impv commented 3 years ago

Description of changes: Hi, I've modified some code in 'PatternMatch' function. The previous version have some error at pattern_regex, it will not match anything. And I also found the code in 'hasPattern' is missing, so I add it from ver0.1.7.

Here is some testing when I modify 'PatternMatch' testing

sungwy-backup commented 2 years ago

I am upvoting this PR. Seeing the same issue with PatternMatch:

dept = [("Finance",10), 
        ("Marketing",20), 
        ("Sales",30), 
        ("IT",40) 
      ]
deptColumns = ["dept_name","dept_id"]
deptDF = spark.createDataFrame(data=dept, schema = deptColumns)
analysis_runner = AnalysisRunner(spark).onData(deptDF)

analysis_runner = analysis_runner.addAnalyzer(PatternMatch("dept_name", ".*"))
analysis_runner = analysis_runner.addAnalyzer(PatternMatch("dept_name", "([A-Za-z]{3,7})"))
analysis_result = analysis_runner.run()
analysis_result_json = AnalyzerContext.successMetricsAsJson(
    spark, analysis_result
)
print(analysis_result_json)

print(deptDF.filter(deptDF.dept_name.rlike(".*")).count())
print(deptDF.filter(deptDF.dept_name.rlike("([A-Za-z]{3,7})")).count())

[{'entity': 'Column', 'instance': 'dept_name', 'name': 'PatternMatch', 'value': 0.0}, {'entity': 'Column', 'instance': 'dept_name', 'name': 'PatternMatch', 'value': 0.0}]

4 3

Dudar99 commented 2 years ago

I am not a contributor here, but can also confirm that it will work after this MR. It will be very helpful for me and many others to have this fix since now it is always 0.0 value for matching values.

chenliu0831 commented 1 year ago

@jasonlin0189impv mind add a unit-test? We could take this PR over if you don't have time. Thanks!

jasonlin0189impv commented 1 year ago

Hi @chenliu0831, I have added the unit-test for checker.hasPattern and analyzer.PatternMatch. Let me know if there is any problem, thanks!

jasonlin0189impv commented 1 year ago

Hi @chenliu0831, is the test case ok? Or suggest any modifications that need to be made?

chenliu0831 commented 11 months ago

@jasonlin0189impv I'm sorry this falls out of my github notification. Thanks for fixing the issues and the contribution