Kotlin / dataframe

Structured data processing in Kotlin
https://kotlin.github.io/dataframe/overview.html
Apache License 2.0
761 stars 48 forks source link

Kotlin Notebook crash when creating DataFrame from List<Map> #710

Open cmelchior opened 1 month ago

cmelchior commented 1 month ago

This code:

%use dataframe
val df = (1..1000).map {
  mapOf(
    "id" to it,
    "value" to "value$it"
  )
}.toDataFrame()

Crashes Kotlin Notebooks with:


The problem is found in one of the loaded libraries: check library converters (fields callbacks)
Error compiling code:
@DataSchema
interface _DataFrameType { }

val ColumnsContainer<_DataFrameType>.entries: DataColumn<kotlin.collections.Set<kotlin.collections.Map.Entry<K, V>>> @JvmName("_DataFrameType_entries") get() = this["entries"] as DataColumn<kotlin.collections.Set<kotlin.collections.Map.Entry<K, V>>>
val DataRow<_DataFrameType>.entries: kotlin.collections.Set<kotlin.collections.Map.Entry<K, V>> @JvmName("_DataFrameType_entries") get() = this["entries"] as kotlin.collections.Set<kotlin.collections.Map.Entry<K, V>>
val ColumnsContainer<_DataFrameType>.keys: DataColumn<kotlin.collections.Set<K>> @JvmName("_DataFrameType_keys") get() = this["keys"] as DataColumn<kotlin.collections.Set<K>>
val DataRow<_DataFrameType>.keys: kotlin.collections.Set<K> @JvmName("_DataFrameType_keys") get() = this["keys"] as kotlin.collections.Set<K>
val ColumnsContainer<_DataFrameType>.size: DataColumn<Int> @JvmName("_DataFrameType_size") get() = this["size"] as DataColumn<Int>
val DataRow<_DataFrameType>.size: Int @JvmName("_DataFrameType_size") get() = this["size"] as Int
val ColumnsContainer<_DataFrameType>.values: DataColumn<kotlin.collections.Collection<V>> @JvmName("_DataFrameType_values") get() = this["values"] as DataColumn<kotlin.collections.Collection<V>>
val DataRow<_DataFrameType>.values: kotlin.collections.Collection<V> @JvmName("_DataFrameType_values") get() = this["values"] as kotlin.collections.Collection<V>
(df as org.jetbrains.kotlinx.dataframe.DataFrame<*>).cast<_DataFrameType>()

Errors:
Line_6.jupyter.kts (4:110 - 111) Unresolved reference: K
Line_6.jupyter.kts (4:113 - 114) Unresolved reference: V
Line_6.jupyter.kts (4:177 - 250) Unchecked cast: AnyCol /* = DataColumn<*> */ to DataColumn<Set<Map.Entry<[Error type: Unresolved type for K], [Error type: Unresolved type for V]>>>
Line_6.jupyter.kts (4:243 - 244) Unresolved reference: K
Line_6.jupyter.kts (4:246 - 247) Unresolved reference: V
Line_6.jupyter.kts (5:90 - 91) Unresolved reference: K
Line_6.jupyter.kts (5:93 - 94) Unresolved reference: V
Line_6.jupyter.kts (5:156 - 217) Unchecked cast: Any? to Set<Map.Entry<[Error type: Unresolved type for K], [Error type: Unresolved type for V]>>
Line_6.jupyter.kts (5:211 - 212) Unresolved reference: K
Line_6.jupyter.kts (5:214 - 215) Unresolved reference: V
Line_6.jupyter.kts (6:78 - 79) Unresolved reference: K
Line_6.jupyter.kts (6:135 - 175) Unchecked cast: AnyCol /* = DataColumn<*> */ to DataColumn<Set<[Error type: Unresolved type for K]>>
Line_6.jupyter.kts (6:172 - 173) Unresolved reference: K
Line_6.jupyter.kts (7:58 - 59) Unresolved reference: K
Line_6.jupyter.kts (7:140 - 141) Unresolved reference: K
Line_6.jupyter.kts (8:113 - 131) Unchecked cast: AnyCol /* = DataColumn<*> */ to DataColumn<Int>
Line_6.jupyter.kts (10:87 - 88) Unresolved reference: V
Line_6.jupyter.kts (10:148 - 195) Unchecked cast: AnyCol /* = DataColumn<*> */ to DataColumn<Collection<[Error type: Unresolved type for V]>>
Line_6.jupyter.kts (10:192 - 193) Unresolved reference: V
Line_6.jupyter.kts (11:67 - 68) Unresolved reference: V
Line_6.jupyter.kts (11:160 - 161) Unresolved reference: V

org.jetbrains.kotlinx.jupyter.exceptions.ReplLibraryException: The problem is found in one of the loaded libraries: check library converters (fields callbacks)
    at org.jetbrains.kotlinx.jupyter.exceptions.CompositeReplExceptionKt.throwLibraryException(CompositeReplException.kt:52)
    at org.jetbrains.kotlinx.jupyter.codegen.FieldsProcessorImpl.process(FieldsProcessorImpl.kt:68)
    at org.jetbrains.kotlinx.jupyter.repl.impl.CellExecutorImpl$execute$1$1.invoke(CellExecutorImpl.kt:98)
    at org.jetbrains.kotlinx.jupyter.repl.impl.CellExecutorImpl$execute$1$1.invoke(CellExecutorImpl.kt:97)
    at org.jetbrains.kotlinx.jupyter.config.LoggingKt.catchAll(Logging.kt:77)
    at org.jetbrains.kotlinx.jupyter.config.LoggingKt.catchAll$default(Logging.kt:71)
    at org.jetbrains.kotlinx.jupyter.repl.impl.CellExecutorImpl.execute(CellExecutorImpl.kt:97)
    at org.jetbrains.kotlinx.jupyter.repl.execution.CellExecutor$DefaultImpls.execute$default(CellExecutor.kt:12)
    at org.jetbrains.kotlinx.jupyter.repl.impl.ReplForJupyterImpl.evaluateUserCode(ReplForJupyterImpl.kt:581)
    at org.jetbrains.kotlinx.jupyter.repl.impl.ReplForJupyterImpl.access$evaluateUserCode(ReplForJupyterImpl.kt:136)
    at org.jetbrains.kotlinx.jupyter.repl.impl.ReplForJupyterImpl$evalEx$1.invoke(ReplForJupyterImpl.kt:439)
    at org.jetbrains.kotlinx.jupyter.repl.impl.ReplForJupyterImpl$evalEx$1.invoke(ReplForJupyterImpl.kt:436)
    at org.jetbrains.kotlinx.jupyter.repl.impl.ReplForJupyterImpl.withEvalContext(ReplForJupyterImpl.kt:417)
    at org.jetbrains.kotlinx.jupyter.repl.impl.ReplForJupyterImpl.evalEx(ReplForJupyterImpl.kt:436)
    at org.jetbrains.kotlinx.jupyter.messaging.IdeCompatibleMessageRequestProcessor$processExecuteRequest$1$response$1$1.invoke(IdeCompatibleMessageRequestProcessor.kt:140)
    at org.jetbrains.kotlinx.jupyter.messaging.IdeCompatibleMessageRequestProcessor$processExecuteRequest$1$response$1$1.invoke(IdeCompatibleMessageRequestProcessor.kt:139)
    at org.jetbrains.kotlinx.jupyter.execution.JupyterExecutorImpl$Task.execute(JupyterExecutorImpl.kt:42)
    at org.jetbrains.kotlinx.jupyter.execution.JupyterExecutorImpl$executorThread$1.invoke(JupyterExecutorImpl.kt:82)
    at org.jetbrains.kotlinx.jupyter.execution.JupyterExecutorImpl$executorThread$1.invoke(JupyterExecutorImpl.kt:80)
    at kotlin.concurrent.ThreadsKt$thread$thread$1.run(Thread.kt:30)
Caused by: org.jetbrains.kotlinx.jupyter.exceptions.ReplCompilerException: Line_6.jupyter.kts (4:110 - 111) Unresolved reference: K
Line_6.jupyter.kts (4:113 - 114) Unresolved reference: V
Line_6.jupyter.kts (4:177 - 250) Unchecked cast: AnyCol /* = DataColumn<*> */ to DataColumn<Set<Map.Entry<[Error type: Unresolved type for K], [Error type: Unresolved type for V]>>>
Line_6.jupyter.kts (4:243 - 244) Unresolved reference: K
Line_6.jupyter.kts (4:246 - 247) Unresolved reference: V
Line_6.jupyter.kts (5:90 - 91) Unresolved reference: K
Line_6.jupyter.kts (5:93 - 94) Unresolved reference: V
Line_6.jupyter.kts (5:156 - 217) Unchecked cast: Any? to Set<Map.Entry<[Error type: Unresolved type for K], [Error type: Unresolved type for V]>>
Line_6.jupyter.kts (5:211 - 212) Unresolved reference: K
Line_6.jupyter.kts (5:214 - 215) Unresolved reference: V
Line_6.jupyter.kts (6:78 - 79) Unresolved reference: K
Line_6.jupyter.kts (6:135 - 175) Unchecked cast: AnyCol /* = DataColumn<*> */ to DataColumn<Set<[Error type: Unresolved type for K]>>
Line_6.jupyter.kts (6:172 - 173) Unresolved reference: K
Line_6.jupyter.kts (7:58 - 59) Unresolved reference: K
Line_6.jupyter.kts (7:140 - 141) Unresolved reference: K
Line_6.jupyter.kts (8:113 - 131) Unchecked cast: AnyCol /* = DataColumn<*> */ to DataColumn<Int>
Line_6.jupyter.kts (10:87 - 88) Unresolved reference: V
Line_6.jupyter.kts (10:148 - 195) Unchecked cast: AnyCol /* = DataColumn<*> */ to DataColumn<Collection<[Error type: Unresolved type for V]>>
Line_6.jupyter.kts (10:192 - 193) Unresolved reference: V
Line_6.jupyter.kts (11:67 - 68) Unresolved reference: V
Line_6.jupyter.kts (11:160 - 161) Unresolved reference: V
    at org.jetbrains.kotlinx.jupyter.repl.impl.JupyterCompilerImpl.compileSync(JupyterCompilerImpl.kt:201)
    at org.jetbrains.kotlinx.jupyter.repl.impl.InternalEvaluatorImpl.eval(InternalEvaluatorImpl.kt:120)
    at org.jetbrains.kotlinx.jupyter.repl.impl.CellExecutorImpl$execute$1$result$1.invoke(CellExecutorImpl.kt:79)
    at org.jetbrains.kotlinx.jupyter.repl.impl.CellExecutorImpl$execute$1$result$1.invoke(CellExecutorImpl.kt:77)
    at org.jetbrains.kotlinx.jupyter.repl.impl.ReplForJupyterImpl.withHost(ReplForJupyterImpl.kt:758)
    at org.jetbrains.kotlinx.jupyter.repl.impl.CellExecutorImpl.execute(CellExecutorImpl.kt:77)
    at org.jetbrains.kotlinx.jupyter.repl.execution.CellExecutor$DefaultImpls.execute$default(CellExecutor.kt:12)
    at org.jetbrains.kotlinx.jupyter.repl.impl.CellExecutorImpl$ExecutionContext.execute(CellExecutorImpl.kt:239)
    at org.jetbrains.kotlinx.dataframe.jupyter.Integration.execute(Integration.kt:77)
    at org.jetbrains.kotlinx.dataframe.jupyter.Integration.execute(Integration.kt:90)
    at org.jetbrains.kotlinx.dataframe.jupyter.Integration.updateAnyFrameVariable(Integration.kt:125)
    at org.jetbrains.kotlinx.dataframe.jupyter.Integration.access$updateAnyFrameVariable(Integration.kt:67)
    at org.jetbrains.kotlinx.dataframe.jupyter.Integration$onLoaded$4.invoke(Integration.kt:289)
    at org.jetbrains.kotlinx.dataframe.jupyter.Integration$onLoaded$4.invoke(Integration.kt:284)
    at org.jetbrains.kotlinx.jupyter.api.libraries.FieldHandlerFactory.createUpdateExecution$lambda$0(FieldHandlerFactory.kt:49)
    at org.jetbrains.kotlinx.jupyter.codegen.FieldsProcessorImplKt.executeEx(FieldsProcessorImpl.kt:95)
    at org.jetbrains.kotlinx.jupyter.codegen.FieldsProcessorImplKt.access$executeEx(FieldsProcessorImpl.kt:1)
    at org.jetbrains.kotlinx.jupyter.codegen.FieldsProcessorImpl.process(FieldsProcessorImpl.kt:47)
    ... 18 more

``
zaleslaw commented 1 month ago

Working solution

val df = (1..1000).toDataFrame {
    "id" from { it }
    "value" from {"value$it" }
}
zaleslaw commented 1 month ago

Need to check - does it have the same behaviour in the Gradle projects

Jolanrensen commented 1 month ago

It works fine in gradle projects, so we'll need to check what weird type inference is going on in the Jupyter integration...

It creates a DataFrame like:

entrieskeyssizevalues
[id=1, value=value1][id, value]2[1, value1]
[id=2, value=value2][id, value]2[2, value2]
[id=3, value=value3][id, value]2[3, value3]
[id=4, value=value4][id, value]2[4, value4]
[id=5, value=value5][id, value]2[5, value5]
[id=6, value=value6][id, value]2[6, value6]
[id=7, value=value7][id, value]2[7, value7]
[id=8, value=value8][id, value]2[8, value8]
[id=9, value=value9][id, value]2[9, value9]
[id=10, value=value10][id, value]2[10, value10]

However, was this the solution you were looking for?

I suspect you want to create a dataframe with a column id and a column value right? Then indeed @zaleslaw 's solution works great.

Constructing a DataFrame is usually done by column and not by row, as that's how they're stored in memory. That's why all DataFrame creation methods are built the way they are. If you have a List<Map<String, T>> and you want each map to be like a row, you could make something like this:

val df = (1..1000).map {
    mapOf("id" to it, "value" to "value$it")
}.toDataFrame {
    source.map { it["id"] }.toColumn() into "id"
    source.map { it["value"] }.toColumn() into "value"
}

If you really have to construct a DataFrame row by row, in theory you could but it would essentially entail making many small DFs and concatenating them, like:

val df = (1..1000).map {
    mapOf("id" to it, "value" to "value$it")
}.map {
    dataFrameOf(header = it.keys, values = it.values)
}.concat()
cmelchior commented 1 month ago

Ah yeah. Good explanation. I just copied some code from ChatGPT as part of testing something else when I saw the crash. So I didn't realize that it did things slightly wrong.