MrPowers / quinn

pyspark methods to enhance developer productivity 📣 👯 🎉
https://mrpowers.github.io/quinn/
Apache License 2.0
597 stars 93 forks source link

bug/documentation: How is anti_trim meant to work? #227

Closed TomBurdge closed 2 days ago

TomBurdge commented 1 week ago

The anti_trim function says its behaviour is as follows: "Remove whitespace from the boundaries of col using the regexp_replace function"

So, I would expect the function to do just that, restated: remove leading and trailing whitespace from the string. This description sounds much more like "trim" than "anti-trim". So, if I had " I like fish "", I would expect "I like fish"

However, the test for anti-trim for the above string returns " Ilikefish ".

    df = quinn.create_df(
        spark,
        [
            ("  I like     fish  ", "  Ilikefish  "),
            ("    zombies", "    zombies"),
            ("  simpsons   cat lady   ", "  simpsonscatlady   "),
            (None, None),
        ],
        [
            ("words", StringType(), True),
            ("expected", StringType(), True),
        ],
    )
    actual_df = df.withColumn("words_anti_trimmed", quinn.anti_trim(F.col("words")))
    chispa.assert_column_equality(actual_df, "words_anti_trimmed", "expected")

Perhaps this is how it's meant to work? Although I'm not sure about the use case for this transformation if so.

I have raised as bug/docs because I am not sure if either:

  1. The code is working as it should. I am misunderstanding and the docstring could be clearer.
  2. The code is not working as intended, and the test is incorrect.
nijanthanvijayakumar commented 2 days ago

@TomBurdge - I reckon the doc-string for that function is incorrect/not aligned.

If you see what's in README.md for that function, it matches with the functionality implemented.

I will try raising a PR to correct the doc-string.

FYI - @SemyonSinchenko and @MrPowers

nijanthanvijayakumar commented 2 days ago

take

nijanthanvijayakumar commented 2 days ago

Created a PR #231 for updating the doc-string.