alibaba / GraphScope

🔨 🍇 💻 🚀 GraphScope: A One-Stop Large-Scale Graph Computing System from Alibaba | 一站式图计算系统
https://graphscope.io
Apache License 2.0
3.32k stars 448 forks source link

[proposal] Python API enhancement for compatibility of Giraph #1329

Open yecol opened 2 years ago

yecol commented 2 years ago

Here is a revision proposal of the Python API, to

# class path

# upload the jars/gars to k8s
sess.add_lib("xxx.jar")

# Loading graph
# Leave add_vertices and add_edges for property graphs
# Use load_from? e.g.,
# formatter parser:
# giraph:com.p2pvif
# gar:xyb.gar:ClassName
g1 = sess.load_from(vertices="p2p.v", vformat="giraph:com.P2PVIF", edges="p2p.e", eformat="giraph:com.P2PEIF")

# vertices can be omitted when only have efile.
g2 = sess.load_from(edges="p2p.e", eformat="giraph:com.P2PEIF")

# adj format stores edges along with the vertices.
g3 = sess.load_from(vertices="p2p.v", vformat="giraph:com.P2PADJ")

# enum class Fmt provides many built-in formatters TODO
g4 = sess.load_from(vertices="p2p.v", vformat=Fmt.ADJVidFloatVidInt)

# scanf formatter TODO
g5 = sess.load_from(vertices="p2p.v", vformat="%d %d", edges="p2p.e", eformat="{src:int} {dst:int} {name:string}")  

# computation
giraph_sssp = load_app(algo="giraph:com.alibaba.graphscope.example.bfs.BFS")

# loaded graph can be directed used for the giraph app
# an implicit conversion triggers if g is a property graph.
# for apps, always return a new graph g
new_graph = giraph_sssp(g1, src=6)

partition_app_name  = load_app("gar_partitioner:xxx.gar:xxx")
# hence partitioner is a special kind of app.
g6 = partition_app_name(g5, partitioner="2d")
g7 = partition_app_name(g6, partitioner="HashingPartitioner")

# move to_xxx methods to graph, rather than context. 
new_graph.to_numpy('v.data')
# hence the selector can be used to output the graph elements.
new_graph.to_numpy('v.id')
# no longer has r selector

# for some apps return void(does not modify the graph) 
# or generate tensors, keep the context and the r selector
ctx = pattern_match(g, pattern="xxx")
ctx.to_tensor("")

# unify the output method to existing save_to
# existing method for serialization
new_graph.save_to("file://serialization_path")
# to output with a customized formatter
new_graph.save_to("file:///tmp/path", format="giraph:xxx")
yecol commented 2 years ago

These summaries your proposal and the discussion, do you have anything to add? @zhanglei1949

zhanglei1949 commented 2 years ago

These summaries your proposal and the discussion, do you have anything to add? @zhanglei1949

I think your summary is quite clear, looking good to me.

siyuan0322 commented 2 years ago

I have some questions that may need to be clarified,

  1. All apps except that return tensor context should return a graph? Not only the apps with vertex data context?
  2. We could do this in almost pure python, we could extend the run_app with another select and add_column operation, then return the extended graph, (and project to a simple graph? I think this is not necessary as we have hidden the definition of projected-graph from user, and we have a implicit project operation before running app)
  3. Apps may have more than one column as the result, such as hits. How to handle the case that
    g1 = hits(g)  # results have 2 columns (properties): auth, hub 
    g2 = sssp(g1) # Cannot project because vertex have 2 properties.
  4. Based on the type of context, we have different selector formats. As it will return (property) graphs, should we only reserve the selector format of the most complex one, which is associated with the labeld-property-context, like v:label.prop, r:label.prop.