Kotlin / kotlin-spark-api

This projects gives Kotlin bindings and several extensions for Apache Spark. We are looking to have this as a part of Apache Spark 3.x
Apache License 2.0
459 stars 35 forks source link

[feat] (Typed)Column functions fixes #54 #87

Closed Jolanrensen closed 3 years ago

Jolanrensen commented 3 years ago

While looking at the select() functions, I found a couple of ways to improve the API:

Firstly, similar to the Scala API, a Column can now be created from a dataset using dataset("yourColumn") now, as well as the usual dataset.col("yourColumn").

Secondly, I walked over all Column operator functions in the Scala API and looked at which could be ported over to Kotlin operator or infix funs (< and >= etc. won't work, due to Kotlin limitations). I tried to avoid `backtick` names as they are not that easy to work with as well as staying as close to the names already in the API (so I deprecated == and &&). I'll provide a quick Scala to Kotlin guide below:

Scala    -> Kotlin
a == b   ->  a == b
a !== b  ->  a !== b
a === b  ->  a eq b
a =!= b  ->  a neq b
-a       ->  -a
!a       ->  !a
a > b    ->  a gt b
a < b    ->  a lt b
a <= b   ->  a leq b
a >= b   ->  a geq b
a || b   ->  a or b
a && b   ->  a and b
a + b    ->  a + b
a - b    ->  a - b
a * b    ->  a * b
a / b    ->  a / b
a % b    ->  a % b
a.between(1, 10)  -> a inRangeOf 1..10
a.getItem(0)      ->  a[0]

Let me know what you think!

Finally, to prepare for the select() functions, I looked at the TypedColumn functions. One of these can now be created from a normal Column using a new as() function:

val column: TypedColumn<Any, Int> = col("yourColumn").`as`<Int>()

But also, to get a TypedColumn specific to a certain type of Dataset (which is needed for the select() function), I added a bit of Kotlin reflection into the mix:

val dataset: Dataset<YourClass> = ...
val column: TypedColumn<YourClass, Int> = dataset.col(YourClass::yourColumn)

I also added an invoke-variant of this, so when calling select() on a Dataset you can now do:

dataset.select(dataset(YourClass::columnA), dataset(YourClass::columnB))

and it will return the right kind of typed Dataset :).

Jolanrensen commented 3 years ago

@asm0dey How do you feel about these additions? :)

Jolanrensen commented 3 years ago

Actually, slight improvement. I made a separate col() function as well, not tied to a certain Dataset just like in the org.apache.spark.sql.functions. However, since using reflection we now know the type and TypedDataset inherits from Dataset, the function returns a TypedDataset.

In short:

val dataset: Dataset<YourClass> = ...
val new: Dataset<Tuple2<TypeOfA, TypeOfB>> = dataset.select( col(YourClass::a), col(YourClass::b) )

This works (The method will probably be selectTyped and return Pairs soon in a separate pull request: https://github.com/JetBrains/kotlin-spark-api/pull/88)

asm0dey commented 3 years ago

Could you please update README for this in PR too? I'm not sure it's correct to deprecate backticked functions: they cost us nothing and may be more readable/usable for somebody.

asm0dey commented 3 years ago

Aldo Kotlin docs should definitely contain their Scala counterparts documented

asm0dey commented 3 years ago

refs #54

Jolanrensen commented 3 years ago

@asm0dey shall I add the other functions with backticks as well? It feels wrong to just have those two.

asm0dey commented 3 years ago

@Jolanrensen I think yes, @Meosit clearly stated that they think they may be useful, I think so too.

Jolanrensen commented 3 years ago

@asm0dey == calls $eq$eq$eq, so it should be === right? (it's called equalTo() in Java, different to equals() which uses 2 =s.

asm0dey commented 3 years ago

Yep, triple-eq definitely should be ===

Jolanrensen commented 3 years ago

Yep, triple-eq definitely should be ===

@asm0dey Alright, I'll change that. Unfortunately `>` etc are not allowed since it's an illegal character :(. That can only be gt.

asm0dey commented 3 years ago

Can we use them without backticks, or Column already has compareTo method?

Jolanrensen commented 3 years ago

Can we use them without backticks, or Column already has compareTo method?

@asm0dey Nope, because compareTo needs to return an Int and in the code something < otherThing will always return a Boolean. It cannot return a Column like in Scala.

asm0dey commented 3 years ago

Then we definitely have no choice which is good and bad at the same time :)

Jolanrensen commented 3 years ago

@asm0dey `||` is highlighted as "Name contains characters which can cause problems on Windows: |"... I'm not sure if this only becomes a problem when calling the function or just having it there.

asm0dey commented 3 years ago

Won't compile on Windows hosts, so let's not risk

Jolanrensen commented 3 years ago

@asm0dey alright, is there something I missed?

asm0dey commented 3 years ago

Reviewing it right now + waiting for tests to complete

asm0dey commented 3 years ago

Thank you for your effort!