Kotlin / kotlin-spark-api

This projects gives Kotlin bindings and several extensions for Apache Spark. We are looking to have this as a part of Apache Spark 3.x
Apache License 2.0
459 stars 35 forks source link

Feat: (WIP) Stdlib functions #102

Open Jolanrensen opened 3 years ago

Jolanrensen commented 3 years ago

As discussed in the issue https://github.com/JetBrains/kotlin-spark-api/issues/100 it would be nice to have more stdlib functions to work with Datasets too, since it's one of Kotlin's selling points.

I've started converting the _Collections.kt from the stdlib to Dataset and I've managed to get to about a third, to filterIndexed.

It already contains a lot of helpful functions, like last(), firstOrNull {}, drop(), all {} etc, but there are many left to do. Many are faster but prone to out of memory issues when first converted to an Iterable. This holds for functions like first {} etc. I plan to have a code inspection plugin hint the user in these cases.

It's nowhere near done, but since I'm going away for a couple of weeks, I thought it might be cool to share the functions I already created so they can be tested already and maybe encompassed in the API itself. Of course, feel free to continue my work in my absence. Many stdlib functions are still left and the RDD could also use them ;).

asm0dey commented 3 years ago

@Jolanrensen would you be able to fix the conflicts?

Jolanrensen commented 3 years ago

@asm0dey It's nowhere near finished though. I'm having second thoughts about the scale of the standard library though. Maybe it's a bit too much to add everything for Spark and we need to take a look at what is helpful and what isn't.

asm0dey commented 3 years ago

IMHO it is, yes