apache / incubator-graphar

An open source, standard data file format for graph data storage and retrieval.
https://graphar.apache.org/
Apache License 2.0
214 stars 46 forks source link

[Feat][Spark] Replace scala.Option[Long] in constructor of VertexWriter due to incompatibility with py4j #313

Closed SemyonSinchenko closed 8 months ago

SemyonSinchenko commented 8 months ago

Is your feature request related to a problem? Please describe. Currently the scala.Option[Long] is used in constructor of VertexWriter. It is good from the scala point of view but there is a problem when you try to call this constructor from python with py4j. The reason is that py4j make autoboxing and autounboxing of Java primitives. Let's imagine we have py4j.java_gateway.JVMView object in python, that is name jvm. Even if you try to pass into constructor something like jvm.scala.Some(jvm.java.lang.Long.valueOf(100)) it will run the instruction in parentheses first, get java.lang.Long as a result, make autounboxing it into python int and pass int into Some with autoboxing it to java.lang.Integer that produce Some[Int], not Some[Long]. And, because of using Option in constructor, py4j cannot understand what is expected type here and cannot make autoboxing of int into jvm.java.lang.Long. I tried different tricks, but each time I'm getting java.lang.Integer cannot be cast to java.lang.Long. There is a similar discussion in py4j issues, which is open from 2019 and it looks like there is no solution. So, currently, the only option to make a VertexWriter from python is pass jvm.scala.Option.empty() and no way to pass optional parameter numVertices.

Describe the solution you'd like Make tow constructor of VertexWriter. The first one like VertexWriter(prefix: String, vertexInfo: VertexInfo, vertexDf: DataFrame, numVertices: Long) and the second one like VertexWriter(prefix: String, vertexInfo: VertexInfo, vertexDf: DataFrame).

Describe alternatives you've considered Alternative solution is to add an additional method into VertexInfo companion object, like createFromPy4j(prefix: String, vertexInfo: VertexInfo, vertexDf: DataFrame, numVertices: Long): VertexInfo. It allows to decide on the python side, which one to call: if python user provide numVertices than call a fabric-method from companion object, otherwise call constructor with jvm.scala.Option.empty().

Additional context I can do it by myself. I'm sure, that no additional changes are required and all spark scala API will be the same.

P.S. I understand, that such a way is not a "scala way", but in my experience each spark library sooner or later face the question of creating PySpark bindings. It looks OK for me to sacrifice some scala code beauty to get better compatibility with PySpark...

acezen commented 8 months ago

I think we can discard the Option in VertexWriter constructor, and just pass a Long like

class VertexWriter(
    prefix: String,
    vertexInfo: VertexInfo,
    vertexDf: DataFrame,
    numVertices: Long = -1
) {
   private val vertexNum: Long = numVertices match {
       case x if x < 0 =>vertexDf.count()
       case _ => numVertices
}

The solution make sense since we do not want user to pass a negative value as vertex num. What do you think?

SemyonSinchenko commented 8 months ago

I think we can discard the Option in VertexWriter constructor, and just pass a Long like

class VertexWriter(
    prefix: String,
    vertexInfo: VertexInfo,
    vertexDf: DataFrame,
    numVertices: Long = -1
) {
   private val vertexNum: Long = numVertices match {
       case x if x < 0 =>vertexDf.count()
       case _ => numVertices
}

The solution make sense since we do not want user to pass a negative value as vertex num. What do you think?

I like it! I can implement it, may you assign this task to me? I will update the code and the docstring, and I will check the documentation if it is used there. Thank you!