MrPowers / bebe

Filling in the Spark function gaps across APIs
50 stars 5 forks source link

Create df.typedCol method #2

Closed MrPowers closed 3 years ago

MrPowers commented 3 years ago

DataFrames have schemas that contain the name and type for each column.

The built-in column constructors don't use the type information and build generic Column objects. df["some_string"] returns an untyped Column object.

We can add a typedCol method that'll return IntegerColumn, StringColumn, DateColumn, etc. objects based on the schema of the underlying DataFrame.

Suppose the some_string column in the DataFrame is a StringType column. df.typedCol("some_string") should return a StringColumn. Under the hood, it can infer the column type with df.dtypes.

@alfonsorr - can you try to add this if you have a sec? I'm guessing this'll just take you a few mins!

alfonsorr commented 3 years ago

If we are talking about pure dataframes (no datasets) this is impossible to get in compile time. Until we read the schema of the data we can't have any certainties extracting the type of the column of a dataframe. The most we can do is to validate the type returning an Either, or throwing the error directly. something like:

df.typedCol[IntegerColumn]("some_string")

That can derive in two possible errors:

I don't know if it's what you were expecting.

MrPowers commented 3 years ago

@alfonsorr - I didn't realize that df.dtypes was a runtime thing, thanks for clarifying.

I guess we can start with a Dataset method like ds.typedColumn then :|

I still we're heading in a direction that'll be more useful that what frameless offers. Frameless requires a new case class definition whenever a withColumn method is called.

I think we'll be able to add a withTypedColumn method that'll take the existing DataFrame case class, understand what type of column is being appended, and magically make a new case class under the hood so the user doesn't need to define a new case class every time they add a column. I hope this is possible 🙏

alfonsorr commented 3 years ago

i've pushed a draft for this issue in #3 with an idea of how it can be solved. Take a look and give me some feedback. The method is not called typedCol is called get, but the rest is what was expected.

MrPowers commented 3 years ago

This was added 😎