[proposal] Python API enhancement for compatibility of Giraph

yecol commented 2 years ago

Here is a revision proposal of the Python API, to

support the Java Giraph apps, by introducing the capability of the customized input/output parser and formats.
hide the concept of context, by always generating a new graph after an application ran.

# class path

# upload the jars/gars to k8s
sess.add_lib("xxx.jar")

# Loading graph
# Leave add_vertices and add_edges for property graphs
# Use load_from? e.g.,
# formatter parser:
# giraph:com.p2pvif
# gar:xyb.gar:ClassName
g1 = sess.load_from(vertices="p2p.v", vformat="giraph:com.P2PVIF", edges="p2p.e", eformat="giraph:com.P2PEIF")

# vertices can be omitted when only have efile.
g2 = sess.load_from(edges="p2p.e", eformat="giraph:com.P2PEIF")

# adj format stores edges along with the vertices.
g3 = sess.load_from(vertices="p2p.v", vformat="giraph:com.P2PADJ")

# enum class Fmt provides many built-in formatters TODO
g4 = sess.load_from(vertices="p2p.v", vformat=Fmt.ADJVidFloatVidInt)

# scanf formatter TODO
g5 = sess.load_from(vertices="p2p.v", vformat="%d %d", edges="p2p.e", eformat="{src:int} {dst:int} {name:string}")  

# computation
giraph_sssp = load_app(algo="giraph:com.alibaba.graphscope.example.bfs.BFS")

# loaded graph can be directed used for the giraph app
# an implicit conversion triggers if g is a property graph.
# for apps, always return a new graph g
new_graph = giraph_sssp(g1, src=6)

partition_app_name  = load_app("gar_partitioner:xxx.gar:xxx")
# hence partitioner is a special kind of app.
g6 = partition_app_name(g5, partitioner="2d")
g7 = partition_app_name(g6, partitioner="HashingPartitioner")

# move to_xxx methods to graph, rather than context. 
new_graph.to_numpy('v.data')
# hence the selector can be used to output the graph elements.
new_graph.to_numpy('v.id')
# no longer has r selector

# for some apps return void(does not modify the graph) 
# or generate tensors, keep the context and the r selector
ctx = pattern_match(g, pattern="xxx")
ctx.to_tensor("")

# unify the output method to existing save_to
# existing method for serialization
new_graph.save_to("file://serialization_path")
# to output with a customized formatter
new_graph.save_to("file:///tmp/path", format="giraph:xxx")

yecol commented 2 years ago

These summaries your proposal and the discussion, do you have anything to add? @zhanglei1949

zhanglei1949 commented 2 years ago

These summaries your proposal and the discussion, do you have anything to add? @zhanglei1949

I think your summary is quite clear, looking good to me.

siyuan0322 commented 2 years ago

I have some questions that may need to be clarified,

All apps except that return tensor context should return a graph? Not only the apps with vertex data context?
We could do this in almost pure python, we could extend the run_app with another select and add_column operation, then return the extended graph, (and project to a simple graph? I think this is not necessary as we have hidden the definition of projected-graph from user, and we have a implicit project operation before running app)

Apps may have more than one column as the result, such as hits. How to handle the case that

g1 = hits(g)  # results have 2 columns (properties): auth, hub 
g2 = sssp(g1) # Cannot project because vertex have 2 properties.

Based on the type of context, we have different selector formats. As it will return (property) graphs, should we only reserve the selector format of the most complex one, which is associated with the labeld-property-context, like v:label.prop, r:label.prop.

alibaba / GraphScope

[proposal] Python API enhancement for compatibility of Giraph #1329