alibaba / GraphScope

🔨 🍇 💻 🚀 GraphScope: A One-Stop Large-Scale Graph Computing System from Alibaba | 一站式图计算系统
https://graphscope.io
Apache License 2.0
3.27k stars 443 forks source link

[BUG] Coordinator pod on Helm deployment fails when ReplicaSet or StatefulSet are present via Helm #3728

Open Vetchu opened 5 months ago

Vetchu commented 5 months ago

Describe the bug When deploying Graphscope with Helm sometimes I get a failure when coordinator is starting up, can reproduce multiple times.

Concrete message:

kubernetes.client.exceptions.ApiException: (409)
Reason: Conflict
HTTP response headers: HTTPHeaderDict({'Audit-Id': '736263cd-0bc2-4de6-970e-849608c02d8b', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'X-Kubernetes-Pf-Flowschema-Uid': 'e749963a-6c93-4734-ac08-fec4051f6564', 'X-Kubernetes-Pf-Prioritylevel-Uid': 'fcaae050-bc69-4fdd-a56b-6fb27f4774b8', 'Date': 'Fri, 19 Apr 2024 16:13:22 GMT', 'Content-Length': '254'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"statefulsets.apps \"gs-engine-graphscope\" already exists","reason":"AlreadyExists","details":{"name":"gs-engine-graphscope","group":"apps","kind":"statefulsets"},"code":409}

Looks to me like it is trying to control the resources controlled by helm.

To Reproduce Steps to reproduce the behavior:

  1. Deploy via helm:
    helm upgrade --install graphscope graphscope/graphscope -f values.yaml --wait
  2. Look at restarting container from deployment coordinator-graphscope

Expected behavior Coordinator consistently is able to start up.

Environment (please complete the following information):

Additional context Full logs below:

2024-04-19 16:13:22,047 [INFO][coordinator:89]: Start server with args 
coordinator:
  deployment_name: coordinator-graphscope
  endpoint: null
  monitor: false
  monitor_port: 9090
  node_selector: null
  operator_mode: false
  resource:
    limits: null
    requests:
      cpu: 0.5
      memory: 512Mi
  service_port: 59001
hosts_launcher:
  dataset_download_retries: 3
  etcd:
    endpoint: null
    listening_client_port: 2379
    listening_peer_port: 2380
    replicas: 1
  hosts:
  - localhost
kubernetes_launcher:
  config_file: null
  dataset:
    enable: true
    proxy: bnVsbA==
  delete_namespace: false
  deployment_mode: eager
  engine:
    enable_gae: false
    enable_gae_java: false
    enable_gie: false
    enable_gle: false
    enabled_engines: analytical,interactive
    gae_resource:
      limits: null
      requests:
        cpu: '1'
        memory: 1Gi
    gie_executor_resource:
      limits: null
      requests:
        cpu: '1'
        memory: 1Gi
    gie_frontend_resource:
      limits: null
      requests:
        cpu: '0.5'
        memory: 1Gi
    gle_resource:
      limits: null
      requests:
        cpu: '0.2'
        memory: 0.2Gi
    node_selector: null
    preemptive: true
  image:
    pull_policy: IfNotPresent
    pull_secrets: []
    registry: registry.cn-hongkong.aliyuncs.com
    repository: graphscope
    tag: 0.27.0
  mars:
    enable: false
    scheduler_resource:
      limits: null
      requests:
        cpu: 0.2
        memory: 4Mi
    worker_resource:
      limits: null
      requests:
        cpu: 0.2
        memory: 4Mi
  namespace: default
  service_type: NodePort
  volumes: null
  waiting_for_delete: false
launcher_type: k8s
log_level: info
operator_launcher:
  gae_endpoint: ''
  hosts: []
  namespace: default
session:
  dangling_timeout_seconds: -1
  default_local_num_workers: 1
  execution_mode: eager
  instance_id: graphscope
  num_workers: 1
  reconnect: false
  retry_time_seconds: 1
  timeout_seconds: 1200
show_log: false
solution: GraphScope One
vineyard:
  deployment_name: null
  image: vineyardcloudnative/vineyardd:latest
  resource:
    limits:
      cpu: '0.5'
      memory: 512Mi
    requests:
      cpu: '0.5'
      memory: 512Mi
  rpc_port: 9600
  socket: null

2024-04-19 16:13:22,047 [INFO][launcher:44]: Failed to resolve the openmpi path, moving towards the system-wide one
2024-04-19 16:13:22,060 [INFO][kubernetes_launcher:816]: Creating engine pods...
2024-04-19 16:13:23,321 [ERROR][kubernetes_launcher:1216]: Error when launching GraphScope on kubernetes cluster
2024-04-19 16:13:23,321 [ERROR][kubernetes_launcher:1216]: Error when launching GraphScope on kubernetes cluster
Traceback (most recent call last):
  File "/home/graphscope/.local/lib/python3.10/site-packages/gscoordinator/kubernetes_launcher.py", line 1210, in start
    self._create_services()
  File "/home/graphscope/.local/lib/python3.10/site-packages/gscoordinator/kubernetes_launcher.py", line 871, in _create_services
    self._create_engine_stateful_set()
  File "/home/graphscope/.local/lib/python3.10/site-packages/gscoordinator/kubernetes_launcher.py", line 827, in _create_engine_stateful_set
Traceback (most recent call last):
    response = self._apps_api.create_namespaced_stateful_set(
  File "/home/graphscope/.local/lib/python3.10/site-packages/gscoordinator/kubernetes_launcher.py", line 1210, in start
    self._create_services()
  File "/home/graphscope/.local/lib/python3.10/site-packages/gscoordinator/kubernetes_launcher.py", line 871, in _create_services
  File "/home/graphscope/.local/lib/python3.10/site-packages/kubernetes/client/api/apps_v1_api.py", line 639, in create_namespaced_stateful_set
    return self.create_namespaced_stateful_set_with_http_info(namespace, body, **kwargs)  # noqa: E501
  File "/home/graphscope/.local/lib/python3.10/site-packages/kubernetes/client/api/apps_v1_api.py", line 738, in create_namespaced_stateful_set_with_http_info
    return self.api_client.call_api(
  File "/home/graphscope/.local/lib/python3.10/site-packages/kubernetes/client/api_client.py", line 348, in call_api
    return self.__call_api(resource_path, method,
  File "/home/graphscope/.local/lib/python3.10/site-packages/kubernetes/client/api_client.py", line 180, in __call_api
    response_data = self.request(
    self._create_engine_stateful_set()
  File "/home/graphscope/.local/lib/python3.10/site-packages/gscoordinator/kubernetes_launcher.py", line 827, in _create_engine_stateful_set
    response = self._apps_api.create_namespaced_stateful_set(
  File "/home/graphscope/.local/lib/python3.10/site-packages/kubernetes/client/api/apps_v1_api.py", line 639, in create_namespaced_stateful_set
    return self.create_namespaced_stateful_set_with_http_info(namespace, body, **kwargs)  # noqa: E501
  File "/home/graphscope/.local/lib/python3.10/site-packages/kubernetes/client/api/apps_v1_api.py", line 738, in create_namespaced_stateful_set_with_http_info
    return self.api_client.call_api(
  File "/home/graphscope/.local/lib/python3.10/site-packages/kubernetes/client/api_client.py", line 348, in call_api
  File "/home/graphscope/.local/lib/python3.10/site-packages/kubernetes/client/api_client.py", line 391, in request
    return self.__call_api(resource_path, method,
    return self.rest_client.POST(url,
  File "/home/graphscope/.local/lib/python3.10/site-packages/kubernetes/client/api_client.py", line 180, in __call_api
  File "/home/graphscope/.local/lib/python3.10/site-packages/kubernetes/client/rest.py", line 279, in POST
    return self.request("POST", url,
  File "/home/graphscope/.local/lib/python3.10/site-packages/kubernetes/client/rest.py", line 238, in request
    raise ApiException(http_resp=r)
kubernetes.client.exceptions.ApiException: (409)
Reason: Conflict
HTTP response headers: HTTPHeaderDict({'Audit-Id': '736263cd-0bc2-4de6-970e-849608c02d8b', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'X-Kubernetes-Pf-Flowschema-Uid': 'e749963a-6c93-4734-ac08-fec4051f6564', 'X-Kubernetes-Pf-Prioritylevel-Uid': 'fcaae050-bc69-4fdd-a56b-6fb27f4774b8', 'Date': 'Fri, 19 Apr 2024 16:13:22 GMT', 'Content-Length': '254'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"statefulsets.apps \"gs-engine-graphscope\" already exists","reason":"AlreadyExists","details":{"name":"gs-engine-graphscope","group":"apps","kind":"statefulsets"},"code":409}

    response_data = self.request(
  File "/home/graphscope/.local/lib/python3.10/site-packages/kubernetes/client/api_client.py", line 391, in request
    return self.rest_client.POST(url,
  File "/home/graphscope/.local/lib/python3.10/site-packages/kubernetes/client/rest.py", line 279, in POST
    return self.request("POST", url,
  File "/home/graphscope/.local/lib/python3.10/site-packages/kubernetes/client/rest.py", line 238, in request
    raise ApiException(http_resp=r)
kubernetes.client.exceptions.ApiException: (409)
Reason: Conflict
HTTP response headers: HTTPHeaderDict({'Audit-Id': '736263cd-0bc2-4de6-970e-849608c02d8b', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'X-Kubernetes-Pf-Flowschema-Uid': 'e749963a-6c93-4734-ac08-fec4051f6564', 'X-Kubernetes-Pf-Prioritylevel-Uid': 'fcaae050-bc69-4fdd-a56b-6fb27f4774b8', 'Date': 'Fri, 19 Apr 2024 16:13:22 GMT', 'Content-Length': '254'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"statefulsets.apps \"gs-engine-graphscope\" already exists","reason":"AlreadyExists","details":{"name":"gs-engine-graphscope","group":"apps","kind":"statefulsets"},"code":409}

╭───────────────────── Traceback (most recent call last) ──────────────────────╮
│ /usr/lib/python3.10/runpy.py:196 in _run_module_as_main                      │
│                                                                              │
│   193 │   main_globals = sys.modules["__main__"].__dict__                    │
│   194 │   if alter_argv:                                                     │
│   195 │   │   sys.argv[0] = mod_spec.origin                                  │
│ ❱ 196 │   return _run_code(code, main_globals, None,                         │
│   197 │   │   │   │   │    "__main__", mod_spec)                             │
│   198                                                                        │
│   199 def run_module(mod_name, init_globals=None,                            │
│                                                                              │
│ /usr/lib/python3.10/runpy.py:86 in _run_code                                 │
│                                                                              │
│    83 │   │   │   │   │      __loader__ = loader,                            │
│    84 │   │   │   │   │      __package__ = pkg_name,                         │
│    85 │   │   │   │   │      __spec__ = mod_spec)                            │
│ ❱  86 │   exec(code, run_globals)                                            │
│    87 │   return run_globals                                                 │
│    88                                                                        │
│    89 def _run_module_code(code, init_globals=None,                          │
│                                                                              │
│ /home/graphscope/.local/lib/python3.10/site-packages/gscoordinator/__main__. │
│ py:3 in <module>                                                             │
│                                                                              │
│   1 from gscoordinator.coordinator import launch_graphscope                  │
│   2                                                                          │
│ ❱ 3 launch_graphscope()                                                      │
│   4                                                                          │
│                                                                              │
│ /home/graphscope/.local/lib/python3.10/site-packages/gscoordinator/coordinat │
│ or.py:91 in launch_graphscope                                                │
│                                                                              │
│    88 │   config_logging(config.log_level)                                   │
│    89 │   logger.info("Start server with args \n%s", config.dumps_yaml())    │
│    90 │                                                                      │
│ ❱  91 │   servicer = get_servicer(config)                                    │
2024-04-19 16:13:23,399 [INFO][service:518]: Clean up resources, cleanup_instance: True, is_dangling: False
│    92 │   start_server(servicer, config)                                     │
│    93                                                                        │
│    94                                                                        │
│                                                                              │
│ /home/graphscope/.local/lib/python3.10/site-packages/gscoordinator/coordinat │
│ or.py:121 in get_servicer                                                    │
│                                                                              │
│   118 │   │   │   f"Expect {service_initializers.keys()} of solution paramet │
│   119 │   │   )                                                              │
│   120 │                                                                      │
│ ❱ 121 │   return initializer(config)                                         │
│   122                                                                        │
│   123                                                                        │
│   124 def start_server(                                                      │
│                                                                              │
│ /home/graphscope/.local/lib/python3.10/site-packages/gscoordinator/servicer/ │
│ graphscope_one/service.py:591 in init_graphscope_one_service_servicer        │
│                                                                              │
│   588 │   if launcher is None:                                               │
│   589 │   │   raise RuntimeError(f"Expect {type2launcher.keys()} of launcher │
│   590 │                                                                      │
│ ❱ 591 │   return GraphScopeOneServiceServicer(                               │
│   592 │   │   launcher=launcher(config),                                     │
│   593 │   │   dangling_timeout_seconds=config.session.dangling_timeout_secon │
│   594 │   │   log_level=config.log_level,                                    │
│                                                                              │
│ /home/graphscope/.local/lib/python3.10/site-packages/gscoordinator/servicer/ │
│ graphscope_one/service.py:117 in __init__                                    │
│                                                                              │
│   114 │   │   self._launcher = launcher                                      │
│   115 │   │   self._launcher.set_session_workspace(self._session_id)         │
│   116 │   │   if not self._launcher.start():                                 │
│ ❱ 117 │   │   │   raise RuntimeError("Coordinator launching instance failed. │
│   118 │   │                                                                  │
│   119 │   │   self._operation_executor: OperationExecutor = OperationExecuto │
│   120 │   │   │   self._session_id, self._launcher, self._object_manager     │
╰──────────────────────────────────────────────────────────────────────────────╯
RuntimeError: Coordinator launching instance failed.
2024-04-19 16:13:23,434 [INFO][service:518]: Clean up resources, cleanup_instance: True, is_dangling: False
welcome[bot] commented 5 months ago

Thanks for opening your first issue here! Be sure to follow the issue template! And a maintainer will get back to you shortly! Please feel free to contact us on DingTalk, WeChat account(graphscope) or Slack. We are happy to answer your questions responsively.

github-actions[bot] commented 5 months ago

/cc @yecol @sighingnow, this issus/pr has had no activity for a long time, please help to review the status and assign people to work on it.