[BUG] Performance degradation when worker scales in small datasets.

alibaba / GraphScope

🔨 🍇 💻 🚀 GraphScope: A One-Stop Large-Scale Graph Computing System from Alibaba | 一站式图计算系统

https://graphscope.io

Apache License 2.0

3.3k stars 446 forks source link

[BUG] Performance degradation when worker scales in small datasets. #2834

Open haaappy opened 1 year ago

haaappy commented 1 year ago

Describe the bug We tested page_rank on datagen-7_5-fb.e dataset using graphscope:0.21.0 k8s mode, but it failed in add_edegs.

To Reproduce Steps to reproduce the behavior:

create session using k8s mode
read the dataset
create graph using 'add_edges'
failed in 'add_edges'

Expected behavior create graph successful.

Screenshots We found the logs in coordinator pod in k8s. During 'add_edges', the logs show that craete engine headless services .. ... ... kubernetes.client.exceptions.ApiException:(409) Reson:Conflict ..... services 'gs-engine-onmitz-headless' already exists.

At last, coordinator pod did not work.

Environment (please complete the following information):

GraphScope version: [0.21.0]
OS: [Ubuntu 20.04]
Kubernetes Version [v1.21.4]

Additional context We tested dataset 'twitter.e' on k8s mode, all is ok. We tested dataset 'datagen-7_5-fb.e' on hosts mode, all is ok. Maybe big dataset didnot work on k8s mode? (graphscope 0.21.0)

welcome[bot] commented 1 year ago

Thanks for opening your first issue here! Be sure to follow the issue template! And a maintainer will get back to you shortly! Please feel free to contact us on DingTalk, WeChat account(graphscope) or Slack. We are happy to answer your questions responsively.

siyuan0322 commented 1 year ago

This is because the coordinator crashes, then it tries to restart, but it shouldn't try to pull resources again, causing the exists error.

The reason for the failure of add_edge is most likely insufficient memory. What's your startup configuration for the session? And what's the resource spec of your k8s cluster?

Pomelochen commented 1 year ago

This is because the coordinator crashes, then it tries to restart, but it shouldn't try to pull resources again, causing the exists error.

The reason for the failure of add_edge is most likely insufficient memory. What's your startup configuration for the session? And what's the resource spec of your k8s cluster?

The edge data is almost 1G,and we set the session memory param all large than 16G. K8s cluster has 250G memory each node.

siyuan0322 commented 1 year ago

The edge data is almost 1G, and we set the session memory param all large than 16G. K8s cluster has 250G memory each node.

It's weird if the memory is enough. I would like to reproduce that. Could you provide the python scripts that can reproduce this error? I could find the dataset by myself.

Pomelochen commented 1 year ago

The edge data is almost 1G, and we set the session memory param all large than 16G. K8s cluster has 250G memory each node.

It's weird if the memory is enough. I would like to reproduce that. Could you provide the python scripts that can reproduce this error? I could find the dataset by myself.

I just create the session. And process the dataset in data frame without attributes. Including v_df:vid, e_df:from_id,to_id. G=session.g() Graph= G.add_vertices(v_df).add_edges(e_df)

siyuan0322 commented 1 year ago

I got it, you load the graph by dataframe but not read through files. The read from dataframe is served as a convenient way to load small chunks, it doesn't performs well in loading large data.

Could you read the file from file instead? I believe that could solve the problem. You could bind a volume so that you could mount a host path to pods.

Reference: https://graphscope.io/docs/deployment/deploy_graphscope_on_self_managed_k8s#mount-volumes

Pomelochen commented 1 year ago

I got it, you load the graph by dataframe but not read through files. The read from dataframe is served as a convenient way to load small chunks, it doesn't performs well in loading large data.

Could you read the file from file instead? I believe that could solve the problem. You could bind a volume so that you could mount a host path to pods.

Reference: https://graphscope.io/docs/deployment/deploy_graphscope_on_self_managed_k8s#mount-volumes

Thank you, I will try to load the graph from file. By the way, I try to test the session _param “num_workers”. I run the pagerank algorithm on dataset twitter.e. I set the num_workers 1,2,4. However the more workers, the slower the program runs. It seems that algorithm can’t run parallely.

siyuan0322 commented 1 year ago

On hosts or k8s mode? If in hosts mode, grape_engine will try to utilize the number of std::hardware_concurrency() threads, if you have 4 workers, it will incur some overhead by performing communication between processes, so it's possible that on hosts mode, within single machine, more workers makes it slower.

Pomelochen commented 1 year ago

On hosts or k8s mode? If in hosts mode, grape_engine will try to utilize the number of std::hardware_concurrency() threads, if you have 4 workers, it will incur some overhead by performing communication between processes, so it's possible that on hosts mode, within single machine, more workers makes it slower.

I test the workers on both hosts and k8s mode. And get the same result. So I feel puzzled. I test the k8s mode in official docker image. If I want to parallel, I need to set something else in addition to this param?

siyuan0322 commented 1 year ago

I'll try to reproduce that in k8s mode. 😂

Pomelochen commented 1 year ago

Ok, thank you🤝

siyuan0322 commented 1 year ago

Could you please give me your testing script if convenient?

siyuan0322 commented 1 year ago

Related to #2898

siyuan0322 commented 1 year ago

Confirmed the performance degradation on a dataset with 12983637 edges, A simple test results are:	local worker	time	k8s worker
1	0.271543	1	0.246835
2	0.556023	2	0.693073
4	0.992	4	0.946844