feat: Support `JSON` payload file format for GraphAr

acezen commented 1 year ago

Is your feature request related to a problem? Please describe. GraphAr is a graph file format that supports a variety of payload file formats, including CSV, Parquet, and ORC. However, it does not currently support the HDF5 payload file format. This issue proposes adding support for HDF5 to GraphAr.

JSON is a lightweight data-interchange format. It is easy for humans to read and write. it's widely use in graph dataset.

Describe the solution you'd like For different libraries, we can have different implementation.

C++: since apache arrow now only support read json file, we can only support read json in C++ library. related code: the FileType enum: https://github.com/apache/incubator-graphar/blob/b33c1f0f36246fa45761fb5d122f869d18432a7e/cpp/include/gar/fwd.h#L76 https://github.com/apache/incubator-graphar/blob/b33c1f0f36246fa45761fb5d122f869d18432a7e/cpp/src/filesystem.cc#L98-L158 the ReadFileToTable seems use a unify API to read file to table, and may support JSON in fact.
Spark: spark support read and write json format, so we can support read/write json in Spark library. related code: the FileType enum: https://github.com/apache/incubator-graphar/blob/b33c1f0f36246fa45761fb5d122f869d18432a7e/spark/datasources-32/src/main/scala/org/apache/graphar/datasources/GarDataSource.scala#L171-L179 https://github.com/apache/incubator-graphar/blob/b33c1f0f36246fa45761fb5d122f869d18432a7e/spark/datasources-32/src/main/scala/org/apache/graphar/datasources/GarTable.scala#L92-L102 just add JSONWriterBuilder related code as csv/parquet/orc: https://github.com/apache/incubator-graphar/tree/main/spark/datasources-32/src/main/scala/org/apache/graphar/datasources

Additional context This issue is a part of issue https://github.com/alibaba/GraphAr/issues/74 and is a good first issue for beginners to get familiar with GraphAr.

amygbAI commented 4 months ago

Hi, are there any prerequisites to contribute to this ? i would love to help .. just fyi, i have never contributed to any open source projects yet but i am just fascinated by graphs in general , hence my interest

lixueclaire commented 4 months ago

Hi, are there any prerequisites to contribute to this ? i would love to help .. just fyi, i have never contributed to any open source projects yet but i am just fascinated by graphs in general , hence my interest

Hi @amygbAI, Thanks for your interest in GraphAr! We welcome new contributors with open arms. For a good start, please check out our Getting Started (C++ library) and our Community page for how to join and contribute. If you have any questions, feel free to ask. Looking forward to your contribution!

amygbAI commented 4 months ago

Hi, have finished with the changes and i am also done with building the project-maven and c++ part of the code. No issues there. Sadly, to test these changes i am unable to find any examples in the documentation ..i can go through the code and figure it out but do you folks have any example files i can use to test this ?

acezen commented 4 months ago

Hi, have finished with the changes and i am also done with building the project-maven and c++ part of the code. No issues there. Sadly, to test these changes i am unable to find any examples in the documentation ..i can go through the code and figure it out but do you folks have any example files i can use to test this ?

Hi， @amygbAI , you can refer to spark the example to generate a json format of Movie graph, and use the data to test you code.

amygbAI commented 4 months ago

here's what i did so far ..

in the incubator-graphar/maven-projects/spark/graphar/testing folder, created a build.sbt

` name := "testing"

version := "0.1"

scalaVersion := "2.13.12"

libraryDependencies += "org.apache.spark" %% "spark-sql" % "3.3.1" `

in the src/main/scala - used your test file like so

`import org.apache.spark.sql.SparkSession import org.apache.graphar.graph.GraphWriter

object MainObject { // This is your main object def main(args: Array[String]): Unit = { // connect to the Neo4j instance val spark = SparkSession.builder() .appName("Neo4j to GraphAr for Movie Graph") .config("neo4j.url", "bolt://localhost:7687") .config("neo4j.authentication.type", "basic") .config("neo4j.authentication.basic.username", "neo4j") .config("neo4j.authentication.basic.password", "slayer#666") .config("spark.master", "local") .getOrCreate() // initialize a graph writer val writer: GraphWriter = new GraphWriter()

  // put movie graph data into writer
  readAndPutDataIntoWriter(writer, spark)

  // write in GraphAr format
  val outputPath: String = args(0)
  val vertexChunkSize: Long = args(1).toLong
  val edgeChunkSize: Long = args(2).toLong
  val fileType: String = args(3)

  writer.write(outputPath, spark, "MovieGraph", vertexChunkSize, edgeChunkSize, fileType)

} } ` and when i "sbt run" it from testing folder i got the errors

[error] /datadrive/GRAPH_AR/incubator-graphar/maven-projects/spark/graphar/testing/src/main/scala/test_170_json_read_write.scala:2:19: object graphar is not a member of package org.apache [error] import org.apache.graphar.graph.GraphWriter

so i thought i might need to go rebuild the scala packages again and went to

incubator-graphar/maven-projects/spark/graphar and ran mvn -X clean install

and ran into the following errors

[ERROR] Failed to execute goal on project graphar-commons: Could not resolve dependencies for project org.apache.graphar:graphar-commons:jar:0.1.0-SNAPSHOT: Could not find artifact org.apache.graphar:graphar-datasources:jar:0.1.0-SNAPSHOT

so bottomline is that unless i can include the correct jar file i doubt if i will be able to test anything ( and the jar file isn't getting compiled thanks to all the issues above )

my guess is that i am missing something fundamental here

acezen commented 4 months ago

so i thought i might need to go rebuild the scala packages again and went to

Hi, amygbAI, you need to run mvn -X clean install in the maven-projects/spark folder, that would compile and install the graphar-datasources and graphar-commons package, and you can run the dataset generator like scrip run-neo4j2graphar.sh

amygbAI commented 4 months ago

thanks and sorry but still getting some errors .. scala version 2.12.10 and jdk8

Run starting. Expected test count is: 21
GraphInfoSuite:
- load graph info *** FAILED ***
  java.lang.NullPointerException:
  at org.apache.graphar.GraphInfoSuite.$anonfun$new$1(TestGraphInfo.scala:35)
  at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
  at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
  at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
  at org.scalatest.Transformer.apply(Transformer.scala:22)
  at org.scalatest.Transformer.apply(Transformer.scala:20)
  at org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:189)
  at org.scalatest.TestSuite.withFixture(TestSuite.scala:196)
  at org.scalatest.TestSuite.withFixture$(TestSuite.scala:195)
  at org.scalatest.funsuite.AnyFunSuite.withFixture(AnyFunSuite.scala:1562)
  ...
- load vertex info *** FAILED ***
  java.lang.NullPointerException:
  at org.apache.graphar.GraphInfoSuite.$anonfun$new$2(TestGraphInfo.scala:61)
  at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
  at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
  at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
  at org.scalatest.Transformer.apply(Transformer.scala:22)
  at org.scalatest.Transformer.apply(Transformer.scala:20)
  at org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:189)
  at org.scalatest.TestSuite.withFixture(TestSuite.scala:196)
  at org.scalatest.TestSuite.withFixture$(TestSuite.scala:195)
  at org.scalatest.funsuite.AnyFunSuite.withFixture(AnyFunSuite.scala:1562)
  ...
- load edge info *** FAILED ***
  java.lang.NullPointerException:
  at org.apache.graphar.GraphInfoSuite.$anonfun$new$8(TestGraphInfo.scala:140)
  at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
  at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
  at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
  at org.scalatest.Transformer.apply(Transformer.scala:22)
  at org.scalatest.Transformer.apply(Transformer.scala:20)
  at org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:189)
  at org.scalatest.TestSuite.withFixture(TestSuite.scala:196)
  at org.scalatest.TestSuite.withFixture$(TestSuite.scala:195)
  at org.scalatest.funsuite.AnyFunSuite.withFixture(AnyFunSuite.scala:1562)
  ...
- == of Property/PropertyGroup/AdjList
TransformExampleSuite:
- transform file type *** FAILED ***
  java.lang.NullPointerException:
  at org.apache.graphar.TransformExampleSuite.$anonfun$new$1(TransformExample.scala:39)
  at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
  at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
  at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
  at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
..............................
Run completed in 2 seconds, 977 milliseconds.
Total number of tests run: 21
Suites: completed 10, aborted 0
Tests: succeeded 1, failed 20, canceled 0, ignored 0, pending 0
*** 20 TESTS FAILED ***
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary for spark 0.1.0-SNAPSHOT:
[INFO]
[INFO] spark .............................................. SUCCESS [  0.746 s]
[INFO] graphar-datasources ................................ SUCCESS [ 25.872 s]
[INFO] graphar-commons .................................... FAILURE [ 59.063 s]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  01:25 min
[INFO] Finished at: 2024-05-20T13:56:23Z
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal org.scalatest:scalatest-maven-plugin:2.0.0:test (test) on project graphar-commons: There are test failures -> [Help 1]
org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute goal org.scalatest:scalatest-maven-plugin:2.0.0:test (test) on project graphar-commons: There are test failures

acezen commented 4 months ago

how about use mvn clean install -DskipTests -P ${1:-'datasources-32'} to compile the spark?

or you can refer the action to see how CI build and run the test

acezen commented 4 months ago

Hi, @amygbAI, I have added a helper example to generate testing ldbc sample data from original CSV to graphar, this may help you to generate testing data with json. Feel free to ask if you have any problem about generating the testing data with the example.

amygbAI commented 4 months ago

thanks so much for sticking with me on this one 👍 ..was able to test and created the pull request. Though i must point out that Neo4j only works with jdk 17 and 21 ..so to export the example csv and json i had to use some antics by changing the current jdk version and then separately test out the changes. Maybe the creators / maintainers of the project already have this on their roadmap. If its already done, kindly update the document ( which gives us steps to build and test spark folder )

apache / incubator-graphar

feat: Support `JSON` payload file format for GraphAr #170