[Discussion] use gpu in docker failed，can I use --gpus param?

m986883511 commented 3 years ago

this command work well docker run --rm -v $(pwd):/output deezer/spleeter-gpu:3.8-2stems separate -o /output /output/3t.mp3

but these command failed docker run --rm -v $(pwd):/output --gpus all deezer/spleeter-gpu:3.8-2stems separate -o /output /output/3t.mp3 docker run --rm -v $(pwd):/output --runtime=nvidia deezer/spleeter-gpu:3.8-2stems separate -o /output /output/3t.mp3

error is : Traceback (most recent call last): File "/usr/local/lib/python3.8/site-packages/tensorflow/python/client/session.py", line 1365, in _do_call return fn(*args) File "/usr/local/lib/python3.8/site-packages/tensorflow/python/client/session.py", line 1349, in _run_fn return self._call_tf_sessionrun(options, feed_dict, fetch_list, File "/usr/local/lib/python3.8/site-packages/tensorflow/python/client/session.py", line 1441, in _call_tf_sessionrun return tf_session.TF_SessionRun_wrapper(self._session, options, feed_dict, tensorflow.python.framework.errors_impl.ResourceExhaustedError: 2 root error(s) found. (0) Resource exhausted: OOM when allocating tensor with shape[51,16,256,512] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[{{node conv2d_transpose_4/conv2d_transpose}}]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

 [[strided_slice_23/_907]]

Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

(1) Resource exhausted: OOM when allocating tensor with shape[51,16,256,512] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[{{node conv2d_transpose_4/conv2d_transpose}}]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

0 successful operations. 0 derived errors ignored.

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/usr/local/bin/spleeter", line 8, in sys.exit(entrypoint()) File "/usr/local/lib/python3.8/site-packages/spleeter/main.py", line 256, in entrypoint spleeter() File "/usr/local/lib/python3.8/site-packages/typer/main.py", line 214, in call return get_command(self)(*args, kwargs) File "/usr/local/lib/python3.8/site-packages/click/core.py", line 829, in call return self.main(args, kwargs) File "/usr/local/lib/python3.8/site-packages/click/core.py", line 782, in main rv = self.invoke(ctx) File "/usr/local/lib/python3.8/site-packages/click/core.py", line 1259, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/usr/local/lib/python3.8/site-packages/click/core.py", line 1066, in invoke return ctx.invoke(self.callback, ctx.params) File "/usr/local/lib/python3.8/site-packages/click/core.py", line 610, in invoke return callback(args, kwargs) File "/usr/local/lib/python3.8/site-packages/typer/main.py", line 497, in wrapper return callback(use_params) # type: ignore File "/usr/local/lib/python3.8/site-packages/spleeter/main.py", line 128, in separate separator.separate_to_file( File "/usr/local/lib/python3.8/site-packages/spleeter/separator.py", line 382, in separate_to_file sources = self.separate(waveform, audio_descriptor) File "/usr/local/lib/python3.8/site-packages/spleeter/separator.py", line 323, in separate return self._separate_tensorflow(waveform, audio_descriptor) File "/usr/local/lib/python3.8/site-packages/spleeter/separator.py", line 305, in _separate_tensorflow prediction = next(prediction_generator) File "/usr/local/lib/python3.8/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 631, in predict preds_evaluated = mon_sess.run(predictions) File "/usr/local/lib/python3.8/site-packages/tensorflow/python/training/monitored_session.py", line 774, in run return self._sess.run( File "/usr/local/lib/python3.8/site-packages/tensorflow/python/training/monitored_session.py", line 1279, in run return self._sess.run( File "/usr/local/lib/python3.8/site-packages/tensorflow/python/training/monitored_session.py", line 1384, in run raise six.reraise(original_exc_info) File "/usr/local/lib/python3.8/site-packages/six.py", line 703, in reraise raise value File "/usr/local/lib/python3.8/site-packages/tensorflow/python/training/monitored_session.py", line 1369, in run return self._sess.run(args, kwargs) File "/usr/local/lib/python3.8/site-packages/tensorflow/python/training/monitored_session.py", line 1437, in run outputs = _WrappedSession.run( File "/usr/local/lib/python3.8/site-packages/tensorflow/python/training/monitored_session.py", line 1200, in run return self._sess.run(*args, **kwargs) File "/usr/local/lib/python3.8/site-packages/tensorflow/python/client/session.py", line 957, in run result = self._run(None, fetches, feed_dict, options_ptr, File "/usr/local/lib/python3.8/site-packages/tensorflow/python/client/session.py", line 1180, in _run results = self._do_run(handle, final_targets, final_fetches, File "/usr/local/lib/python3.8/site-packages/tensorflow/python/client/session.py", line 1358, in _do_run return self._do_call(_run_fn, feeds, fetches, targets, options, File "/usr/local/lib/python3.8/site-packages/tensorflow/python/client/session.py", line 1384, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.ResourceExhaustedError: 2 root error(s) found. (0) Resource exhausted: OOM when allocating tensor with shape[51,16,256,512] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[node conv2d_transpose_4/conv2d_transpose (defined at /lib/python3.8/site-packages/spleeter/model/functions/unet.py:164) ]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

 [[strided_slice_23/_907]]

Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

(1) Resource exhausted: OOM when allocating tensor with shape[51,16,256,512] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[node conv2d_transpose_4/conv2d_transpose (defined at /lib/python3.8/site-packages/spleeter/model/functions/unet.py:164) ]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

0 successful operations. 0 derived errors ignored.

Errors may have originated from an input operation. Input Source operations connected to node conv2d_transpose_4/conv2d_transpose: concatenate_3/concat (defined at /lib/python3.8/site-packages/spleeter/model/functions/unet.py:162)

Input Source operations connected to node conv2d_transpose_4/conv2d_transpose: concatenate_3/concat (defined at /lib/python3.8/site-packages/spleeter/model/functions/unet.py:162)

Original stack trace for 'conv2d_transpose_4/conv2d_transpose': File "/bin/spleeter", line 8, in sys.exit(entrypoint()) File "/lib/python3.8/site-packages/spleeter/main.py", line 256, in entrypoint spleeter() File "/lib/python3.8/site-packages/typer/main.py", line 214, in call return get_command(self)(*args, kwargs) File "/lib/python3.8/site-packages/click/core.py", line 829, in call return self.main(args, kwargs) File "/lib/python3.8/site-packages/click/core.py", line 782, in main rv = self.invoke(ctx) File "/lib/python3.8/site-packages/click/core.py", line 1259, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/lib/python3.8/site-packages/click/core.py", line 1066, in invoke return ctx.invoke(self.callback, ctx.params) File "/lib/python3.8/site-packages/click/core.py", line 610, in invoke return callback(args, kwargs) File "/lib/python3.8/site-packages/typer/main.py", line 497, in wrapper return callback(use_params) # type: ignore File "/lib/python3.8/site-packages/spleeter/main.py", line 128, in separate separator.separate_to_file( File "/lib/python3.8/site-packages/spleeter/separator.py", line 382, in separate_to_file sources = self.separate(waveform, audio_descriptor) File "/lib/python3.8/site-packages/spleeter/separator.py", line 323, in separate return self._separate_tensorflow(waveform, audio_descriptor) File "/lib/python3.8/site-packages/spleeter/separator.py", line 305, in _separate_tensorflow prediction = next(prediction_generator) File "/lib/python3.8/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 612, in predict estimator_spec = self._call_model_fn(features, None, ModeKeys.PREDICT, File "/lib/python3.8/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1163, in _call_model_fn model_fn_results = self._model_fn(features=features, kwargs) File "/lib/python3.8/site-packages/spleeter/model/init.py", line 568, in model_fn return builder.build_predict_model() File "/lib/python3.8/site-packages/spleeter/model/init.py", line 516, in build_predict_model tf.estimator.ModeKeys.PREDICT, predictions=self.outputs File "/lib/python3.8/site-packages/spleeter/model/init.py", line 318, in outputs self._build_outputs() File "/lib/python3.8/site-packages/spleeter/model/init.py", line 499, in _build_outputs self._outputs = self._build_output_waveform(self.masked_stfts) File "/lib/python3.8/site-packages/spleeter/model/init.py", line 342, in masked_stfts self._build_masked_stfts() File "/lib/python3.8/site-packages/spleeter/model/init.py", line 465, in _build_masked_stfts for instrument, mask in self.masks.items(): File "/lib/python3.8/site-packages/spleeter/model/init.py", line 336, in masks self._build_masks() File "/lib/python3.8/site-packages/spleeter/model/init.py", line 432, in _build_masks output_dict = self.model_outputs File "/lib/python3.8/site-packages/spleeter/model/init.py", line 312, in model_outputs self._build_model_outputs() File "/lib/python3.8/site-packages/spleeter/model/init.py", line 211, in _build_model_outputs self._model_outputs = apply_model( File "/lib/python3.8/site-packages/spleeter/model/functions/unet.py", line 197, in unet return apply(apply_unet, input_tensor, instruments, params) File "/lib/python3.8/site-packages/spleeter/model/functions/init.py", line 44, in apply output_dict[out_name] = function( File "/lib/python3.8/site-packages/spleeter/model/functions/unet.py", line 164, in apply_unet up5 = conv2d_transpose_factory(conv_n_filters[0], (5, 5))((merge4)) File "/lib/python3.8/site-packages/tensorflow/python/keras/engine/base_layer_v1.py", line 776, in call outputs = call_fn(cast_inputs, *args, kwargs) File "/lib/python3.8/site-packages/tensorflow/python/keras/layers/convolutional.py", line 1291, in call outputs = backend.conv2d_transpose( File "/lib/python3.8/site-packages/tensorflow/python/util/dispatch.py", line 201, in wrapper return target(*args, *kwargs) File "/lib/python3.8/site-packages/tensorflow/python/keras/backend.py", line 5177, in conv2d_transpose x = nn.conv2d_transpose(x, kernel, output_shape, strides, File "/lib/python3.8/site-packages/tensorflow/python/util/dispatch.py", line 201, in wrapper return target(args, kwargs) File "/lib/python3.8/site-packages/tensorflow/python/ops/nn_ops.py", line 2482, in conv2d_transpose return conv2d_transpose_v2( File "/lib/python3.8/site-packages/tensorflow/python/util/dispatch.py", line 201, in wrapper return target(*args, **kwargs) File "/lib/python3.8/site-packages/tensorflow/python/ops/nn_ops.py", line 2560, in conv2d_transpose_v2 return gen_nn_ops.conv2d_backprop_input( File "/lib/python3.8/site-packages/tensorflow/python/ops/gen_nn_ops.py", line 1293, in conv2d_backpropinput , _, _op, _outputs = _op_def_library._apply_op_helper( File "/lib/python3.8/site-packages/tensorflow/python/framework/op_def_library.py", line 742, in _apply_op_helper op = g._create_op_internal(op_type_name, inputs, dtypes=None, File "/lib/python3.8/site-packages/tensorflow/python/framework/ops.py", line 3477, in _create_op_internal ret = Operation( File "/lib/python3.8/site-packages/tensorflow/python/framework/ops.py", line 1949, in init self._traceback = tf_stack.extract_stack()

my system: system:Ubuntu 18.04.5 LTS cuda：NVIDIA-SMI 460.80 Driver Version: 460.80 CUDA Version: 11.2

romi1502 commented 3 years ago

Hi @m986883511, I cannot reproduce your bug on a similar setup (Debian 10, nvidia driver version: 460.84, CUDA Version: 11.2) for which the provided commands run smoothly. I can reproduce similar (but not exactly the same error) when tensorflow is running out of GPU RAM (above 8GB of available GPU RAM, I don't have any issue). So maybe you have not enough RAM available on the GPU you're using, either because the whole amount of RAM of your GPU is too low or because some other jobs are concurrently using some of it, making the available memory too low.

m986883511 commented 3 years ago

I fix it，you can use my container, find in my web,https://hub.docker.com/u/m986883511

m986883511 commented 3 years ago

Hi @m986883511, I cannot reproduce your bug on a similar setup (Debian 10, nvidia driver version: 460.84, CUDA Version: 11.2) for which the provided commands run smoothly. I can reproduce similar (but not exactly the same error) when tensorflow is running out of GPU RAM (above 8GB of available GPU RAM, I don't have any issue). So maybe you have not enough RAM available on the GPU you're using, either because the whole amount of RAM of your GPU is too low or because some other jobs are concurrently using some of it, making the available memory too low.

I use tensorflow/tensorflow:2.3.0-gpu base image sove this problem, thank you very mush.

deezer / spleeter

[Discussion] use gpu in docker failed，can I use --gpus param? #647