awslabs / python-deequ

Python API for Deequ
Apache License 2.0
691 stars 132 forks source link

ConstraintSuggestionRunner generates incorrect code_for_constraint when ")" met in dataframe values #51

Open miltad opened 3 years ago

miltad commented 3 years ago

Describe the bug Function: __s2p_filter in Class ConstraintSuggestionRunner causes incorrect code_for_constraint in produced constraints suggestions.

@staticmethod
def __s2p_filter(code: str):
    """
    Scala -> Python translator for the constraint suggestions code
    A method that returns the python translation of the scala constraint suggestion

    :param str code: a scala constraint suggestion

    :return code that is translated to look more like python code
    """
    if ' _ ' in code:
        code = code.replace(' _ ', ' lambda x: x ')

    if 'Some(' in code:
        # Usually at the end as 'where' or 'hint' strings as optional
        code = code.replace('Some(', '')[:-1]

    if 'Array(' in code:
        # TODO: what if multiple?
        # TODO: Probz redo with regex
        start = code.index('Array(') + len('Array(')
        for idx in range(start, len(code)):
            if code[idx] == ')':
                code = code[:idx] + ']' + code[idx + 1:]
                code = code.replace('Array(', '[')
                break

    if 'Seq(' in code:
        # TODO: what if multiple?
        # TODO: Probz redo with regex
        start = code.index('Seq(') + len('Seq(')
        for idx in range(start, len(code)):
            if code[idx] == ')':
                code = code[:idx] + ']' + code[idx + 1:]
                code = code.replace('Seq(', '[')
                break

    return code

Due to searching in the string for the first occurrence of ")", as you can see in the above-inserted method's code, it can generate constraint suggestions incorrectly.

For DataFrame which contains example values in its column: "some string input (with additional explanation here)" it can replace with the square right bracket found within column value and produce such code_for_constraint: .isContainedIn("d", ["some string input (with additional explanation here]")) - which isn't valid python code

To Reproduce Here's adjusted code from the tutorial to reflect incorrect behavior: ` from pydeequ.suggestions import * from pyspark.sql import SparkSession, Row

df = (spark.sparkContext.parallelize([ Row(d="some string input (with additional explanation here)"), Row(d="some string input (with additional explanation here)"), Row(d="some string input"), Row(d="some string input"), Row(d="some string input"), Row(d="some string input"), Row(d="some string input"), Row(d="some string input"), Row(d="some string input") ]).toDF())

suggestionResult = ConstraintSuggestionRunner(spark) \ .onData(df) \ .addConstraintRule(DEFAULT()) \ .run()

for code in suggestionResult["constraint_suggestions"]: print(f"column_name: {code['column_name']}") print(f"code_for_constraint: {code['code_for_constraint']}", "\n") `

Expected behavior I expect adjusting __s2p_filter method to replace for closing square bracket correct closing bracket from Array or Seq Scala's code instead of the first found.

Expected value from above To Reproduce section: .isContainedIn("d", ["some string input", "some string input (with additional explanation here)"]) What we get currently from To Reproduce section: .isContainedIn("d", ["some string input", "some string input (with additional explanation here]"))

Screenshots image image