Kotlin / dataframe

Structured data processing in Kotlin
https://kotlin.github.io/dataframe/overview.html
Apache License 2.0
806 stars 55 forks source link

writeArrowFeather not working with nested type ? #271

Open phodal opened 1 year ago

phodal commented 1 year ago

Hi, in my case, I want to create a arrow file in client side, then pass to server side. But when I just try run writeArrowFeather, will show the IndexOutOfBoundsException issues.

Exception in thread "main" java.lang.IndexOutOfBoundsException: index: 31393, length: 2320 (expected: range(0, 32768))
    at org.apache.arrow.memory.ArrowBuf.checkIndex(ArrowBuf.java:701)
    at org.apache.arrow.memory.ArrowBuf.setBytes(ArrowBuf.java:765)
    at org.apache.arrow.vector.BaseVariableWidthVector.setBytes(BaseVariableWidthVector.java:1244)
    at org.apache.arrow.vector.BaseVariableWidthVector.set(BaseVariableWidthVector.java:1059)
    at org.apache.arrow.vector.VarCharVector.set(VarCharVector.java:255)
    at org.jetbrains.kotlinx.dataframe.io.ArrowWriterImpl$infillVector$1.invoke(ArrowWriterImpl.kt:111)
    at org.jetbrains.kotlinx.dataframe.io.ArrowWriterImpl$infillVector$1.invoke(ArrowWriterImpl.kt:111)
    at org.jetbrains.kotlinx.dataframe.api.ForEachKt.forEachIndexed(forEach.kt:34)
    at org.jetbrains.kotlinx.dataframe.io.ArrowWriterImpl.infillVector(ArrowWriterImpl.kt:111)
    at org.jetbrains.kotlinx.dataframe.io.ArrowWriterImpl.allocateVectorAndInfill(ArrowWriterImpl.kt:197)
    at org.jetbrains.kotlinx.dataframe.io.ArrowWriterImpl.allocateVectorSchemaRoot(ArrowWriterImpl.kt:223)
    at org.jetbrains.kotlinx.dataframe.io.ArrowWriter$DefaultImpls.writeArrowFeather(ArrowWriter.kt:114)
    at org.jetbrains.kotlinx.dataframe.io.ArrowWriterImpl.writeArrowFeather(ArrowWriterImpl.kt:61)
    at org.jetbrains.kotlinx.dataframe.io.ArrowWriter$DefaultImpls.writeArrowFeather(ArrowWriter.kt:125)
    at org.jetbrains.kotlinx.dataframe.io.ArrowWriterImpl.writeArrowFeather(ArrowWriterImpl.kt:61)
    at org.jetbrains.kotlinx.dataframe.io.ArrowWriter$DefaultImpls.writeArrowFeather(ArrowWriter.kt:133)
    at org.jetbrains.kotlinx.dataframe.io.ArrowWriterImpl.writeArrowFeather(ArrowWriterImpl.kt:61)
    at org.jetbrains.kotlinx.dataframe.io.ArrowWritingKt.writeArrowFeather(arrowWriting.kt:89)
    at com.phodal.chapi.arrow.MainKt.main(Main.kt:26)
    Suppressed: java.lang.IllegalStateException: Memory was leaked by query. Memory leaked: (33024)
Allocator(ROOT) 0/33024/264192/9223372036854775807 (res/actual/peak/limit)

        at org.apache.arrow.memory.BaseAllocator.close(BaseAllocator.java:437)
        at org.apache.arrow.memory.RootAllocator.close(RootAllocator.java:29)
        at org.jetbrains.kotlinx.dataframe.io.ArrowWriterImpl.close(ArrowWriterImpl.kt:247)
        at kotlin.jdk7.AutoCloseableKt.closeFinally(AutoCloseable.kt:64)
        at org.jetbrains.kotlinx.dataframe.io.ArrowWritingKt.writeArrowFeather(arrowWriting.kt:88)
        ... 1 more

FAILURE: Build failed with an exception.

Here is my demo code with writer and some debug information:

val dataFrame = DataFrame.read("https://raw.githubusercontent.com/phodal-archive/apache-arrow-chapi-demo/master/data/0_codes.json")
dataFrame.schema().print()

val toArrowSchema = dataFrame.columns().toArrowSchema()
println(toArrowSchema.toJson())

dataFrame.writeArrowFeather(File("codes.arrow"))

When i try to debug, in the dataFrame.schema().print(), it will return correct schema:

NodeName: String
Module: String
Type: String
Package: String?
FilePath: String
Fields: *
    TypeType: String
    TypeKey: String
    Modifiers: List<String>
    TypeValue: String?
    Annotations: *
        Name: String
        KeyValues: *
            Key: String
            Value: String

Implements: List<String>
Functions: *
    Name: String
    Package: String?
    ReturnType: String
    Parameters: *
        TypeValue: String
        TypeType: String
    FunctionCalls: *
        Package: String?
        NodeName: String?
        FunctionName: String
        Position:
            StartLine: Int
            StartLinePosition: Int
            StopLine: Int
            StopLinePosition: Int
        Parameters: *
            TypeValue: String
            TypeType: String
        Type: String?
    Position:
        StartLine: Int
        StartLinePosition: Int?
        StopLine: Int
        StopLinePosition: Int?
    LocalVariables: *
        TypeValue: String
        TypeType: String
    IsConstructor: Boolean?
    Annotations: *
        Name: String
        KeyValues: *
            Key: String
            Value: String

Imports: *
    Source: String
    AsName: String
Position:
    StartLine: Int?
    StopLine: Int?
    StartLinePosition: Int?
    StopLinePosition: Int?
Annotations: *
    Name: String

But, in dataFrame.columns().toArrowSchema() the type will be error:

{
  "fields" : [ {
    "name" : "NodeName",
    "nullable" : false,
    "type" : {
      "name" : "utf8"
    },
    "children" : [ ]
  }, {
    "name" : "Module",
    "nullable" : false,
    "type" : {
      "name" : "utf8"
    },
    "children" : [ ]
  }, {
    "name" : "Type",
    "nullable" : false,
    "type" : {
      "name" : "utf8"
    },
    "children" : [ ]
  }, {
    "name" : "Package",
    "nullable" : true,
    "type" : {
      "name" : "utf8"
    },
    "children" : [ ]
  }, {
    "name" : "FilePath",
    "nullable" : false,
    "type" : {
      "name" : "utf8"
    },
    "children" : [ ]
  }, {
    "name" : "Fields",
    "nullable" : true,
    "type" : {
      "name" : "utf8"
    },
    "children" : [ ]
  }, {
    "name" : "Implements",
    "nullable" : true,
    "type" : {
      "name" : "utf8"
    },
    "children" : [ ]
  }, {
    "name" : "Functions",
    "nullable" : true,
    "type" : {
      "name" : "utf8"
    },
    "children" : [ ]
  }, {
    "name" : "Imports",
    "nullable" : true,
    "type" : {
      "name" : "utf8"
    },
    "children" : [ ]
  }, {
    "name" : "Position",
    "nullable" : true,
    "type" : {
      "name" : "utf8"
    },
    "children" : [ ]
  }, {
    "name" : "Annotations",
    "nullable" : true,
    "type" : {
      "name" : "utf8"
    },
    "children" : [ ]
  } ]
}

I lost something?

koperagen commented 1 year ago

No, you're right - nested typed are not yet supported. :( Interesting data you've got there

phodal commented 1 year ago

Thanks for share it. Any plan on it? or I just try to modifiy AnyCol.toArrowField to implementation it ?

koperagen commented 1 year ago

Honestly, i overlooked that our Arrow support misses nested types, so this improvement isn't planned. Right now the team is occupied with improvements to the documentation and notebooks experience. I think nobody is going to work on Arrow in near weeks. You can submit a PR if you want, but apart from toArrowField there will be modification in actual writing here: infillVector https://github.com/Kotlin/dataframe/blob/master/dataframe-arrow/src/main/kotlin/org/jetbrains/kotlinx/dataframe/io/ArrowWriterImpl.kt

phodal commented 1 year ago

Thank you, I will try to find a solution.

Kopilov commented 1 year ago

IndexOutOfBoundsException: index: 31393, length: 2320 (expected: range(0, 32768)) is unexpected error, I am working on this (just got same in my project). This is because VariableWidthVector (where String column is saved to) does not know it's actual size.

About nested types, @phodal, do you have any examples in other Java-based projects with Arrow support as an example? And what is your target Arrow schema (does it contain SructVector, ListVector or any other)?

phodal commented 1 year ago

@Kopilov Sorry, I try to do it, but it need lots of code. So, I don't use dataframe with Arrow, just keep to use JSON.

Kopilov commented 1 year ago

Exception is fixed in #350 Nested types are still not supported natively, should be saved correctly as strings