Naissant / dendri

Common Healthcare feature engineering algorithms implemented in PySpark.
MIT License
2 stars 1 forks source link

cols_to_array does not support passing Columns #5

Closed rileyschack closed 3 years ago

rileyschack commented 3 years ago

cols_to_array only accepts columns passed as a str and not as a Column. Use cases are when a user needs to pass a map with a specified key/value

df = spark.createDataFrame(
    [
        (1, 2, {"a": 5}),
        (1, 2, {"a": 6}),
        (1, 2, {"b": 7})
    ],
    ["col1", "col2", "col3"]
)

df.withColumn("col4", cols_to_array("col1", F.col("col3")["a"]))
rileyschack commented 3 years ago

Obviously a workaround is to create the column first. Doesn't work as nicely though if the user wants to use a list comprehension to create an array of n columns (with some sort of transformation first).

WesRoach commented 3 years ago

I didn't anticipate users referencing a key from a Map column through a Column object. Valid use case.

PySpark comes with all sorts of built in ways to parse user parameters and we're probably better off falling back to their solutions. The downside is of course that many of the functions they utilize are not user-facing, i.e. _function_name.

Existing solution is written for pre-3.1.0 and requires falling back to Spark SQL built-ins as the PySpark DataFrame API did not contain a reference to the filter function. PySpark >=3.1.0 now contains a DataFrame API reference to the filter function for arrays.

I'm not sure there's any interest in maintaining support for older versions of PySpark at this time given our user base of 6.