Open xhejtman opened 3 years ago
by default, gpushare scheduler will try to allocate different pods on gpu cards with "Binpack" first policy. Binpack means multiple pods who request gpu memory, will be placed on same card of same node, in order to leave as many free gpu cards for "big" job. In that case, yes, your two pod would be possible to share the same gpu card. however, that's a kind of best effort policy. Means those two pods still can be allocated to different cards, if the card which pod1 placed doesn't have enough memory for pod2. Then pod2 will be placed to another card.
Is there any chance to extending this plugin so that it would be possible to request allocation from the same physical card? It is usable for Statefulset deployments where you might need to share the same physical gpu among all containers.
I'm trying to use this plugin and use your example code and it seems it doesn't work as declared.
ALIYUN_COM_GPU_MEM_DEV=14
ALIYUN_COM_GPU_MEM_CONTAINER=2
2021-11-11 22:22:59.635156: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2021-11-11 22:22:59.675793: E tensorflow/core/common_runtime/direct_session.cc:170] Internal: failed initializing StreamExecutor for CUDA device ordinal 0: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE
/usr/local/lib/python3.5/dist-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
from ._conv import register_converters as _register_converters
0.1
Traceback (most recent call last):
File "/app/main.py", line 40, in <module>
train(fraction)
File "/app/main.py", line 23, in train
sess = tf.Session(config=config)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1482, in __init__
super(Session, self).__init__(target, graph, config=config)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 622, in __init__
self._session = tf_session.TF_NewDeprecatedSession(opts, status)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/errors_impl.py", line 473, in __exit__
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.InternalError: Failed to create session.
This situation occurs when I increase number of replicas to two:
apiVersion: apps/v1
kind: Deployment
metadata:
name: binpack-1
labels:
app: binpack-1
spec:
replicas: 2
selector: # define how the deployment finds the pods it mangages
matchLabels:
app: binpack-1
template: # define the pods specifications
metadata:
labels:
app: binpack-1
spec:
containers:
- name: binpack-1
image: cheyang/gpu-player:v2
resources:
limits:
# GiB
aliyun.com/gpu-mem: 2
I have three cards with 14Gb on each. However, I am not able to run two copies of this software. Why?
Hello,
is it possible for several pods to request gpu share on any but same gpu card? E.g., if you have stateful set consisting of Xserver container and application container, you need those two share the same gpu card.. I would request like 1G mem for each of the containers however, if I have more than one GPU per node, I have no guarantees they use the same device, right?