Kotlin / dataframe

Structured data processing in Kotlin
https://kotlin.github.io/dataframe/overview.html
Apache License 2.0
849 stars 64 forks source link

Apache Hive (or extensible JDBC) support #807

Open christopherfrieler opened 4 months ago

christopherfrieler commented 4 months ago

I tried to connect to Apache Hive via JDBC and faced a few issues, which I would like to share here. Feel free to discuss my ideas or split them into multiple issues.

  1. When I tried it, first I ran into the IllegalArgumentExceptions from https://github.com/Kotlin/dataframe/blob/master/dataframe-jdbc/src/main/kotlin/org/jetbrains/kotlinx/dataframe/io/db/util.kt telling me it's not possible. -> Hive support would be nice.
  2. I tried to add Hive support myself (in my code), but, although I could create a Hive instance of DbType, the when-statements in https://github.com/Kotlin/dataframe/blob/master/dataframe-jdbc/src/main/kotlin/org/jetbrains/kotlinx/dataframe/io/db/util.kt checking the DbType are not extensible. -> The implementation should be easier extensible, as new database technologies arrise and supporting them in Kotlin Dataframe takes a while or even never happens, if it's a rather exotic one. I found the issues #637 and #549, that also ask for further databases to be supported. Maybe there could be something like a "DbTypeRegistry", where users can add custom DbTypes at runtime instead of these static when-statements.
  3. I found a workaround (see below) to connect to Hive and read data from it. However, I had to deal with a lot of SQLFeatureNotSupportedExceptions from the Hive JDBC library, because they did not implement some (optional? I'm not a JDBC expert, so I'm not sure how "optional" these things are, i.e. if they should implement it or client code should be able to deal with the missing SQL features.) stuff, especially regarding metadata. -> Maybe there could be some kind of fallback implementation that might not be as performant, but allows to work with such incomplete JDBC driver implementations.

Finally, my workaround to connect to Hive:

import org.jetbrains.kotlinx.dataframe.DataFrame
import org.jetbrains.kotlinx.dataframe.io.*
import org.jetbrains.kotlinx.dataframe.io.db.DbType
import org.jetbrains.kotlinx.dataframe.schema.ColumnSchema
import java.sql.DriverManager
import java.sql.ResultSet
import java.sql.ResultSetMetaData
import kotlin.reflect.KType

private object Hive: DbType("hive") {
    override val driverClassName: String
        get() = "org.apache.hive.jdbc.HiveDriver"

    override fun buildTableMetadata(tables: ResultSet): TableMetadata {
        TODO("Not yet implemented")
    }

    override fun convertSqlTypeToColumnSchemaValue(tableColumnMetadata: TableColumnMetadata): ColumnSchema? = null

    override fun convertSqlTypeToKType(tableColumnMetadata: TableColumnMetadata): KType? = null

    override fun isSystemTable(tableMetadata: TableMetadata): Boolean = false
}

private class HiveResultSet(private val original: ResultSet) : ResultSet by original {
    private class HiveResultSetMetaData(val original: ResultSetMetaData) : ResultSetMetaData by original {
        override fun getTableName(column: Int): String = ""

        override fun getColumnName(column: Int): String {
            return original.getColumnName(column).substringAfter(".")
        }
    }

    override fun getMetaData(): ResultSetMetaData {
        return HiveResultSetMetaData(original.metaData)
    }
}

public fun DataFrame.Companion.readHiveSqlQuery(
    query: String,
    dbConfig: DatabaseConfiguration,
): DataFrame<*> {
    DriverManager.getConnection(dbConfig.url, dbConfig.user, dbConfig.password).use { connection ->
        connection.createStatement().use { statement ->
            statement.executeQuery(query).use { rs ->
                return readResultSet(HiveResultSet(rs), Hive)
            }
        }
    }
}

This way I managed to get my own Hive DbType into readResultSet(). It definitely has limitations and I'm very unsure about my minimal implementations of the methods of the DbType. But at least it seems to work so far.

zaleslaw commented 4 months ago

Thanks for sharing! There are a few user-requests in the Issues about possiblity to register custom SQL dialects and I believe it will possible not earlier than 0.15 release (now we are finishing with 0.14)

But we have some bottlenecks in our plugin for schema generation and this is a reason why we closed the hierarchy of DB classes. Hope to solve or suggest workaround for this problem and be open for new data sources