anilpacaci / streaming-graph-partitioning

Experimental Setup for Performance Analysis of Streaming Algorithms
Apache License 2.0
30 stars 4 forks source link

Data Loading Mechanism #6

Open anilpacaci opened 7 years ago

anilpacaci commented 7 years ago

For now, we use transactional data loading mechanisms of JanusGraph (Titan). For large datasets, this approach is impractical and we should use Spark or Giraph based BulkLoaderVertexProgram

anilpacaci commented 7 years ago

Here we will rely on HadoopGraph and we will use ScriptInputFormat. Key is to create an adjacency list representation of the graph data generated by LDBC SNB Datagen. Here are the steps;

  1. Load all edge data to Spark RDDs
  2. Join this edge data by source vertex ID and combine target vertex IDs into a list. This way there will be a single adjacency list representation for entire edge data
  3. Write out this adjacency list to HDFS
  4. Write a small script for ScriptInputFormat that reads line from these adjacency list creates vertex for that line
  5. Configure HadoopGraph with ScriptInputFormat and CassandraInputFormat
  6. then simply running BulkLoaderVertexProgram will read these adjacency files and populate cassandra backend
anilpacaci commented 7 years ago

We need to agree on common format for our adjacency list format, meaning that which properties needs to appear in the adjacency list format. Since each vertex type has varying number of properties, it is better to omit vertex properties from the adjacency files. On the other hand, edges will only have labels, and timestamps in some rare cases. So we can just keep edge information in adjacency list.

We will be using Script IO Format documentation. Therefore we have little flexibility about the format we are using. My recommendation is a format where each line represent a vertex and all of its outgoing edges in following format:

sourcevertexid:label neighbour1id:vertexlabel:edgelabel:timestamp neighbour2id:vertexlabel:edgelabel:timestamp

But the problem is right now we have all relations is separate CSV files, meaning that all the knows edges of one person will be in one file, likes edges will be in another file. So we will need to join these CSV files on source vertex id, so that we can aggregate all the edges of one source vertex at one place. You can find an example small dataset here. Now what we need to do is use SparkSQL. We will load all edge files as separate CSV files (spark sql allows csv files to be loaded directly, you do not have to worry about writing a parser) Then join over all these tables, generate the adjacency list and export one huge single file. For now, we can just do it using Spark running on small machine on the sample dataset I provided, then same code can be simply moved to cluster to deal with larger datasets.

anilpacaci commented 7 years ago

``

anilpacaci commented 7 years ago

Guys @z56yu @weiliam , you might want to check this Spark library. It makes it super easier to load entire csv file as a dataframe (SparkSQL's table). Then it is just a matter of join - group by sourceid