MrPowers / bebe

Filling in the Spark function gaps across APIs
50 stars 5 forks source link

add bebe_beginning_of_month #4

Closed MrPowers closed 3 years ago

MrPowers commented 3 years ago

Spark has a last_day function that returns the last day of the month.

This PR introduces a bebe_beginning_of_month function, so users have a performant way to compute the beginning of the month.

I plan on porting a lot of the ActiveSupport datetime helper methods. I want to give Spark users easy access to all the common datetime functions, so they don't have to reinvent the wheel when writing business logic.

yaooqinn commented 3 years ago

Is truncDate(dateVal, 'MONTH') same with this expression

MrPowers commented 3 years ago

@yaooqinn - I didn't know about truncDate (code reference for interested parties). Let me refactor the code to use truncDate - good catch!

MrPowers commented 3 years ago

@yaooqinn - refactored the code, per your comment. Can you take another look and let me know if it looks good now?

yaooqinn commented 3 years ago

For data and timestamp inputs we should respect the data type in results too. It seems that we don't need the definition of BeginningOfMonth and I guess something like

def bebe_beginning_of_month(col: Column): Column = {
  col.expr.dataType match {
    // for timestamps truncating
    case TimestampType => date_trunc("month", col)
    // for dates, strings truncating and fail others
    case _ => trunc("month", col)
  }
}

is enough. And the doc related shall be move here as this is the API for users

MrPowers commented 3 years ago

@yaooqinn - I slightly modified your code to this:

col.expr.dataType match {
  // for timestamps truncating
  case TimestampType => date_trunc("month", col)
  // for dates, strings truncating and fail others
  case _ => trunc(col, "month")
}

It drives me crazy that date_trunc takes the format first and trunc takes the format second 🙃 , haha

Your code causes the following error: org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to dataType on unresolved object, tree: 'some_date

Looks like this cool expr.dataType syntax only works for columns associated with DataFrames. Here's an example of when it works:

val df = Seq(1, 2, 3).toDF("some_num")
println(df("some_num").expr.dataType) // IntegerType

Columns that are not associated with DataFrames, like col("hi").expr.dataType cause this exception Invalid call to dataType on unresolved object, tree: 'hi org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to dataType on unresolved object, tree: 'hi

Let me know if I am missing anything. Thanks for teaching me the expr.dataType syntax. That'll come in handy when I have a DataFrame accessible.

yaooqinn commented 3 years ago

oh, I see. It is too early to call dataType here🤐