Open anilpacaci opened 7 years ago
Here we will rely on HadoopGraph and we will use ScriptInputFormat. Key is to create an adjacency list representation of the graph data generated by LDBC SNB Datagen. Here are the steps;
We need to agree on common format for our adjacency list format, meaning that which properties needs to appear in the adjacency list format. Since each vertex type has varying number of properties, it is better to omit vertex properties from the adjacency files. On the other hand, edges will only have labels, and timestamps in some rare cases. So we can just keep edge information in adjacency list.
We will be using Script IO Format documentation. Therefore we have little flexibility about the format we are using. My recommendation is a format where each line represent a vertex and all of its outgoing edges in following format:
sourcevertexid:label neighbour1id:vertexlabel:edgelabel:timestamp neighbour2id:vertexlabel:edgelabel:timestamp
But the problem is right now we have all relations is separate CSV files, meaning that all the knows
edges of one person will be in one file, likes
edges will be in another file. So we will need to join these CSV files on source vertex id, so that we can aggregate all the edges of one source vertex at one place. You can find an example small dataset here.
Now what we need to do is use SparkSQL. We will load all edge files as separate CSV files (spark sql allows csv files to be loaded directly, you do not have to worry about writing a parser) Then join over all these tables, generate the adjacency list and export one huge single file. For now, we can just do it using Spark running on small machine on the sample dataset I provided, then same code can be simply moved to cluster to deal with larger datasets.
``
Guys @z56yu @weiliam , you might want to check this Spark library. It makes it super easier to load entire csv file as a dataframe (SparkSQL's table). Then it is just a matter of join - group by sourceid
For now, we use transactional data loading mechanisms of JanusGraph (Titan). For large datasets, this approach is impractical and we should use Spark or Giraph based BulkLoaderVertexProgram