Kotlin / dataframe

Structured data processing in Kotlin
https://kotlin.github.io/dataframe/overview.html
Apache License 2.0
822 stars 58 forks source link

Redshift not supported #549

Open aajtodd opened 9 months ago

aajtodd commented 9 months ago

Attempted to try out the new SQL support.

I used the Amazon provided JDBC driver and the connection oriented APIs. Redshift is a flavor of postgres and the driver should handle the differences/mappings so I figured it may just work.

Example Kotlin notebook

%use dataframe(0.12.0)
@file:DependsOn("../redshift-jdbc42-2.1.0.24.jar")

import com.amazon.redshift.jdbc.Driver
import com.amazon.redshift.jdbc.DataSource

data class DatabaseConfig(
    val endpoint: String,
    val database: String,
    val user: String,
    val password: String,
    val port: Int
) {
    val url: String = "jdbc:redshift://$endpoint:$port/$database"
}

val dbConfig = DatabaseConfig(...)

val ds = DataSource().apply {
    setURL(dbConfig.url)
    user = dbConfig.user
    password = dbConfig.password
}

val query = "..."

val df = ds.getConnection().use { conn ->
    DataFrame.readSqlQuery(conn, query)
 }

Results in:

Only H2, MariaDB, MySQL, SQLite and PostgreSQL are supported!
java.lang.IllegalArgumentException: Unsupported database type in the url: jdbc:redshift://<redacted>
Only H2, MariaDB, MySQL, SQLite and PostgreSQL are supported!
    at org.jetbrains.kotlinx.dataframe.io.db.UtilKt.extractDBTypeFromUrl(util.kt:20)
    at org.jetbrains.kotlinx.dataframe.io.ReadJdbcKt.readSqlQuery(readJdbc.kt:183)
    at org.jetbrains.kotlinx.dataframe.io.ReadJdbcKt.readSqlQuery(readJdbc.kt:168)
    at Line_22_jupyter.<init>(Line_22.jupyter.kts:12)
    at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:77)
    at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.base/java.lang.reflect.Constructor.newInstanceWithCaller(Constructor.java:499)
    at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:480)
    at kotlin.script.experimental.jvm.BasicJvmScriptEvaluator.evalWithConfigAndOtherScriptsResults(BasicJvmScriptEvaluator.kt:105)
    at kotlin.script.experimental.jvm.BasicJvmScriptEvaluator.invoke$suspendImpl(BasicJvmScriptEvaluator.kt:47)
    at kotlin.script.experimental.jvm.BasicJvmScriptEvaluator.invoke(BasicJvmScriptEvaluator.kt)
    at kotlin.script.experimental.jvm.BasicJvmReplEvaluator.eval(BasicJvmReplEvaluator.kt:49)
    at org.jetbrains.kotlinx.jupyter.repl.impl.InternalEvaluatorImpl$eval$resultWithDiagnostics$1.invokeSuspend(InternalEvaluatorImpl.kt:107)
    at kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:33)
    at kotlinx.coroutines.DispatchedTask.run(DispatchedTask.kt:106)
    at kotlinx.coroutines.EventLoopImplBase.processNextEvent(EventLoop.common.kt:284)
    at kotlinx.coroutines.BlockingCoroutine.joinBlocking(Builders.kt:85)
    at kotlinx.coroutines.BuildersKt__BuildersKt.runBlocking(Builders.kt:59)
    at kotlinx.coroutines.BuildersKt.runBlocking(Unknown Source)
    at kotlinx.coroutines.BuildersKt__BuildersKt.runBlocking$default(Builders.kt:38)
    at kotlinx.coroutines.BuildersKt.runBlocking$default(Unknown Source)
    at org.jetbrains.kotlinx.jupyter.repl.impl.InternalEvaluatorImpl.eval(InternalEvaluatorImpl.kt:107)
    at org.jetbrains.kotlinx.jupyter.repl.impl.CellExecutorImpl$execute$1$result$1.invoke(CellExecutorImpl.kt:78)
    at org.jetbrains.kotlinx.jupyter.repl.impl.CellExecutorImpl$execute$1$result$1.invoke(CellExecutorImpl.kt:76)
    at org.jetbrains.kotlinx.jupyter.ReplForJupyterImpl.withHost(repl.kt:681)
    at org.jetbrains.kotlinx.jupyter.repl.impl.CellExecutorImpl.execute(CellExecutorImpl.kt:76)
    at org.jetbrains.kotlinx.jupyter.repl.CellExecutor$DefaultImpls.execute$default(CellExecutor.kt:16)
    at org.jetbrains.kotlinx.jupyter.ReplForJupyterImpl$evalEx$1.invoke(repl.kt:496)
    at org.jetbrains.kotlinx.jupyter.ReplForJupyterImpl$evalEx$1.invoke(repl.kt:474)
    at org.jetbrains.kotlinx.jupyter.ReplForJupyterImpl.withEvalContext(repl.kt:437)
    at org.jetbrains.kotlinx.jupyter.ReplForJupyterImpl.evalEx(repl.kt:474)
    at org.jetbrains.kotlinx.jupyter.messaging.ProtocolKt$shellMessagesHandler$2$res$1.invoke(protocol.kt:320)
    at org.jetbrains.kotlinx.jupyter.messaging.ProtocolKt$shellMessagesHandler$2$res$1.invoke(protocol.kt:314)
    at org.jetbrains.kotlinx.jupyter.JupyterExecutorImpl$runExecution$execThread$1.invoke(execution.kt:38)
    at org.jetbrains.kotlinx.jupyter.JupyterExecutorImpl$runExecution$execThread$1.invoke(execution.kt:33)
    at kotlin.concurrent.ThreadsKt$thread$thread$1.run(Thread.kt:30)

It's unclear to me how much work it would be to support Redshift or if using the custom provided driver if it should just work out of the box (using the connection oriented APIs of course). It seems like perhaps the connection oriented APIs though shouldn't be checking the database type or perhaps an escape hatch to disable it?

aajtodd commented 9 months ago

Building the project locally with a custom driver seems to work, haven't tested it much yet:


/**
 * Represents the Amazon Redshift database type.
 */
public object Redshift: DbType("redshift") {
    override val driverClassName: String
        get() = "com.amazon.redshift.Driver"

    override fun convertSqlTypeToColumnSchemaValue(tableColumnMetadata: TableColumnMetadata): ColumnSchema? = null

    // not supported
    // see https://docs.aws.amazon.com/redshift/latest/dg/c_unsupported-postgresql-features.html
    override fun isSystemTable(tableMetadata: TableMetadata): Boolean = false

    override fun buildTableMetadata(tables: ResultSet): TableMetadata {
        return TableMetadata(
            tables.metaData.getTableName(1),
            tables.metaData.getSchemaName(1),
            tables.metaData.getCatalogName(1),
        )
    }

    override fun convertSqlTypeToKType(tableColumnMetadata: TableColumnMetadata): KType? = null
}
public fun extractDBTypeFromUrl(url: String?): DbType {
    if (url != null) {
        return when {
            H2.dbTypeInJdbcUrl in url -> H2
            MariaDb.dbTypeInJdbcUrl in url -> MariaDb
            MySql.dbTypeInJdbcUrl in url -> MySql
            Sqlite.dbTypeInJdbcUrl in url -> Sqlite
            PostgreSql.dbTypeInJdbcUrl in url -> PostgreSql
            Redshift.dbTypeInJdbcUrl in url -> Redshift
            else -> throw IllegalArgumentException("Unsupported database type in the url: $url. " +
                "Only H2, MariaDB, MySQL, SQLite, PostgreSQL, and Amazon Redshift are supported!")
        }
    } else {
        throw SQLException("Database URL could not be null. The existing value is $url")
    }
}

I'm not well versed in JDBC or Redshift so there may be a better way to do some of this. There may also be more to it to actually fully support Redshift IDK. I looked a the postgres type for inspiration. All of the customization centers around money SQL type which isn't supported in Redshift anyway.

It may be nice to allow registering support for custom DB types using a custom driver or perhaps baking in support for Redshift if possible with instructions on how to add the custom driver and throw an exception if not found.

Jolanrensen commented 9 months ago

@zaleslaw is actively working on JDBC support right now and especially the type mapping is bound to undergo some changes in the future. Maybe he can help with a temporary solution in the meantime? Registering custom drivers would be awesome indeed!