[Feedback] Initial experience with GraphScope

Describe the issue

Hello, I had the chance to try this framework for a few hours today. It's very exciting for me and some of the problems I am researching and starting to solve. I have been looking at different ways to handle graph data for a couple of weeks and I have found Apache Giraph really difficult to set up and develop for, and Pyspark really slow or unfriendly for doing simple filtering on a graph. I wanted to share some of my experience as a new user of GraphScope, in the hopes that some of my experience can help this tool improve. Please don't take any of this negatively, that isn't my intention at all.

There are a few things I've written below that I think would deserve their own issue threads. I didn't go into very much detail because I am planning on submitting separate issues. I just wanted to share some first impressions and my own experience with getting started with this tool, especially if it will help you improve GraphScope.

General feedback points so far:

The good

Current tools: I discovered this project from this paper. I was excited to see another group trying to solve this issue, and better than existing frameworks. I have had the chance to try both Pyspark and Giraph recently for graph processing and the development experience has been really challenging for me. Right now my company is experimenting for a problem we will want to start solving shortly.
Documentation: I love your docs and it looks like a lot of effort was dedicated to making the docs easy for a new user. I was having a very difficult time getting Giraph running, and then even time harder finding documentation and examples for getting a Giraph job running.
Local development: installation is incredibly easy. A single pip install for anyone wanting to try this on their local machine is great and much better than most other frameworks out there.
Kubernetes: I've worked with Kubernetes for a few years at various levels (deploying clusters with scripts, maintaining those clusters, developing services on it, automating DevOps processes for deployment, etc.) and think that it's a great deployment environment and much easier than Hadoop. To be honest I haven't looked into that area of deployment for GraphScope yet, however I did share a couple of thoughts below that may be interesting, based on my own prior experiences.
JupyterLab: It was amazing to be able to try GraphScope in 5 mins or less by launching it automatically. I really love this aspect, thanks for making it easy.

Some additional thoughts

Adoption: The first time I read this paper (very quickly), I thought that Alibaba was already using GraphScope in production and it looked perfect for the problem we want to solve. After re-reading it I'm under the impression that the production use case involves Spark (Pyspark?), Giraph, JanusGraph and Tensorflow. Now it makes sense why a tool to handle all four of those would make a lot of sense, and especially help simplify the development experience of an engineer working with graph data. It still isn't very clear if this project is being used already in some production instances, and I started wondering if this is something that is in progress or planned, for replacing the existing infrastructure. I have unfortunately seen many projects get abandoned due to changes in business interest and I'm just wondering if this has good support from the business.
Documentation: I tried running some of the examples in the docs and had some compile issues, some due to C++ (I believe) and then some others. I experienced my session crashing in JupyterLab and I had a really hard time recovering it because of this line. I had to log out, terminate all of the kernels and a few other things to get it running again. I submitted #1150 already and I will submit others as soon as I return to them and as soon as I can.
Datasets: I really liked that MAG is available as a starting dataset. I wasn't able to see some of the properties listed here. Is the dataset provided already pre-processed for proof of demonstration in the docs? From the docs I wasn't able to tell if there were other datasets available, such as a larger MAG dataset or just another dataset. I saw that there were other popular graph datasets available and possibly something to consider adding in the future?
GNN example: I ran the notebook 10. Revisit classification on citation network on k8s. I noticed that the final accuracy for the trained GNN was 12%, and it's quite possible that I misinterpreted it. This isn't a big problem because I believe the original intention was to show how easy it is to use Tensorflow with graph data, and I really believe it is very easy with this tool. I think from the perspective of a new user it would be much more interesting for adopting the tool as a new user if the accuracy was higher, even if it's just 65%, it would be better than a 'probability coin toss'.
Kubernetes: I was personally never a huge fan of Helm. It starts to get really complicated once any business logic is required for maintaining anything past-deployment. For example, if there are job failures you would need to have additional custom control plane for handling that. Personally I think an operator (CRD) fills this use case exactly -- I have created one before for automatically managing dynamic routing rules for development environments and I might be able to provide an example if there's any interest in it. Is this something you've already considered?

Hi @michael-golfi

Thanks for your interest in GraphScope! And thank you for your time to give us such a long and precious feedback!

We are a team focusing on graph computing for many years, we have the similar feelings as you on Giraph, as well as some other graph systems. It could be hard for common users to use. That motivated us to develop GraphScope. One of its goals is to make graph computing accessible to more end-users, hence easy to set up and develop for is always among our first considerations. Glad to hear that the local deployment, cloud-native design and JupyterLab powered playground earned your compliments!

For the second part, we also want to share our thoughts with you:

Adoption. The GraphScope has been widely deployed in Alibaba. Its components are processing 20+ tasks daily in production. It serves as a key part of data infra of Alibaba. However, as you said, there are surely divergences and differences with the opensourced version. Mainly because, the infrastructure and application scenarios at Alibaba is very complex and different in many ways, such as its highly-customized data centers, networking, storage, management and data-infra systems. There are legacy internal applications as well. And all these result in divergences in depolyment and functionalities. That said, the key technologies and much of the core codebase are kept identical with the open-sourced version. We kept the open-sourced version as the trunk branch for active development. We aim to further coverge the internal and the opensourced versions as much, and as quickly as possible in the next a few months. Stay tuned. To answer your question directly, yes, the project will be definitely long-term maintained. In the foreseeable future, we are fully committed on this open-source project and in graph computing in general. :)
Documentation. Thanks for your patience, I have to admit the documentation still needs large improvement. For the particular issue, we will debug and update on the issue thread. Thanks again for the bug reporting!
Datasets: The mag graph in GraphScope is a subset of the original MAG and it was pre-processed for graph learning tasks, thus it doesn't contain all the properties. Actually, it was downloaded from ogb. GraphScope ships some small datasets, for easy getting started. The list includes different kinds of graphs, e.g., homogeneous graph / property graphs, etc. We would extend the list in the future, by ourselves or community help (#1015).
GNN example: I cannot agree more with you! The demo would be more interesting if the accuracy can be higher. Here let me just add some explanations about the current status. The accuracy on the leadboard of this task is in the range of 0.26-0.56, with more advanced and complex models. We choose a simple and classic model (GCN) for easy understanding, with 20 epoch it can archive 26% in our test. In the tutorial, the accuracy is too low since the default epoch is set to 5 in order to reduce training time. We will refine this tutorial by replacing with a better model, or changing the default epoch to make the demo much more reasonable and interesting.
Kubernetes: Yes, we considered to support deployment via operator long before! Actually, as shown in #267, Operator is still a to-do item on our backlog. It would be greatly appreciated if you could share your example/experience with us!

Thank you again for your valuable feedback, sincerely! We are looking forward to your separate issues with any comments/suggestion/bugs during the tryout. All of these contributions will help us to improve GraphScope and make it go a step further.

alibaba / GraphScope