AliyunContainerService / gpushare-device-plugin

GPU Sharing Device Plugin for Kubernetes Cluster
Apache License 2.0
468 stars 144 forks source link

Question: request gpushare on the same GPU #43

Open xhejtman opened 3 years ago

xhejtman commented 3 years ago

Hello,

is it possible for several pods to request gpu share on any but same gpu card? E.g., if you have stateful set consisting of Xserver container and application container, you need those two share the same gpu card.. I would request like 1G mem for each of the containers however, if I have more than one GPU per node, I have no guarantees they use the same device, right?

wsxiaozhang commented 2 years ago

by default, gpushare scheduler will try to allocate different pods on gpu cards with "Binpack" first policy. Binpack means multiple pods who request gpu memory, will be placed on same card of same node, in order to leave as many free gpu cards for "big" job. In that case, yes, your two pod would be possible to share the same gpu card. however, that's a kind of best effort policy. Means those two pods still can be allocated to different cards, if the card which pod1 placed doesn't have enough memory for pod2. Then pod2 will be placed to another card.

xhejtman commented 2 years ago

Is there any chance to extending this plugin so that it would be possible to request allocation from the same physical card? It is usable for Statefulset deployments where you might need to share the same physical gpu among all containers.

swood commented 2 years ago

I'm trying to use this plugin and use your example code and it seems it doesn't work as declared.

ALIYUN_COM_GPU_MEM_DEV=14
ALIYUN_COM_GPU_MEM_CONTAINER=2
2021-11-11 22:22:59.635156: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2021-11-11 22:22:59.675793: E tensorflow/core/common_runtime/direct_session.cc:170] Internal: failed initializing StreamExecutor for CUDA device ordinal 0: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE
/usr/local/lib/python3.5/dist-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
0.1
Traceback (most recent call last):
  File "/app/main.py", line 40, in <module>
    train(fraction)
  File "/app/main.py", line 23, in train
    sess = tf.Session(config=config)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1482, in __init__
    super(Session, self).__init__(target, graph, config=config)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 622, in __init__
    self._session = tf_session.TF_NewDeprecatedSession(opts, status)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/errors_impl.py", line 473, in __exit__
    c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.InternalError: Failed to create session.

This situation occurs when I increase number of replicas to two:

apiVersion: apps/v1
kind: Deployment

metadata:
  name: binpack-1
  labels:
    app: binpack-1

spec:
  replicas: 2

  selector: # define how the deployment finds the pods it mangages
    matchLabels:
      app: binpack-1

  template: # define the pods specifications
    metadata:
      labels:
        app: binpack-1

    spec:
      containers:
      - name: binpack-1
        image: cheyang/gpu-player:v2
        resources:
          limits:
            # GiB
            aliyun.com/gpu-mem: 2

I have three cards with 14Gb on each. However, I am not able to run two copies of this software. Why?