deezer / spleeter

Deezer source separation library including pretrained models.
https://research.deezer.com/projects/spleeter.html
MIT License
26k stars 2.86k forks source link

Tensorflow crashes due to abnormally high GPU RAM allocation #681

Closed Tetsujinfr closed 3 years ago

Tetsujinfr commented 3 years ago

I have installed spleeter through poetry.

I activate the poetry virtual env where all the dependencies have been installed.

When I run the spleeter test command I get this long rant and I can see that my GPU RAM usage goes up and max out at 23.7GB (I have a 24GB GPU card, which I think should be enough):

(spleeter-iG7E_J6Q-py3.8) tetsfr@tetsfr:~/spleeter$ spleeter separate -o output audio_example.mp3 
Traceback (most recent call last):
  File "/home/tetsfr/.cache/pypoetry/virtualenvs/spleeter-iG7E_J6Q-py3.8/lib/python3.8/site-packages/tensorflow/python/client/session.py", line 1375, in _do_call
    return fn(*args)
  File "/home/tetsfr/.cache/pypoetry/virtualenvs/spleeter-iG7E_J6Q-py3.8/lib/python3.8/site-packages/tensorflow/python/client/session.py", line 1359, in _run_fn
    return self._call_tf_sessionrun(options, feed_dict, fetch_list,
  File "/home/tetsfr/.cache/pypoetry/virtualenvs/spleeter-iG7E_J6Q-py3.8/lib/python3.8/site-packages/tensorflow/python/client/session.py", line 1451, in _call_tf_sessionrun
    return tf_session.TF_SessionRun_wrapper(self._session, options, feed_dict,
tensorflow.python.framework.errors_impl.UnknownError: 2 root error(s) found.
  (0) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
     [[{{node conv2d/Conv2D}}]]
     [[strided_slice_23/_907]]
  (1) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
     [[{{node conv2d/Conv2D}}]]
0 successful operations.
0 derived errors ignored.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/tetsfr/.cache/pypoetry/virtualenvs/spleeter-iG7E_J6Q-py3.8/bin/spleeter", line 5, in <module>
    entrypoint()
  File "/home/tetsfr/spleeter/spleeter/__main__.py", line 256, in entrypoint
    spleeter()
  File "/home/tetsfr/.cache/pypoetry/virtualenvs/spleeter-iG7E_J6Q-py3.8/lib/python3.8/site-packages/typer/main.py", line 214, in __call__
    return get_command(self)(*args, **kwargs)
  File "/home/tetsfr/.cache/pypoetry/virtualenvs/spleeter-iG7E_J6Q-py3.8/lib/python3.8/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/home/tetsfr/.cache/pypoetry/virtualenvs/spleeter-iG7E_J6Q-py3.8/lib/python3.8/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/home/tetsfr/.cache/pypoetry/virtualenvs/spleeter-iG7E_J6Q-py3.8/lib/python3.8/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/tetsfr/.cache/pypoetry/virtualenvs/spleeter-iG7E_J6Q-py3.8/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/tetsfr/.cache/pypoetry/virtualenvs/spleeter-iG7E_J6Q-py3.8/lib/python3.8/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/home/tetsfr/.cache/pypoetry/virtualenvs/spleeter-iG7E_J6Q-py3.8/lib/python3.8/site-packages/typer/main.py", line 497, in wrapper
    return callback(**use_params)  # type: ignore
  File "/home/tetsfr/spleeter/spleeter/__main__.py", line 128, in separate
    separator.separate_to_file(
  File "/home/tetsfr/spleeter/spleeter/separator.py", line 378, in separate_to_file
    sources = self.separate(waveform, audio_descriptor)
  File "/home/tetsfr/spleeter/spleeter/separator.py", line 319, in separate
    return self._separate_tensorflow(waveform, audio_descriptor)
  File "/home/tetsfr/spleeter/spleeter/separator.py", line 301, in _separate_tensorflow
    prediction = next(prediction_generator)
  File "/home/tetsfr/.cache/pypoetry/virtualenvs/spleeter-iG7E_J6Q-py3.8/lib/python3.8/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 631, in predict
    preds_evaluated = mon_sess.run(predictions)
  File "/home/tetsfr/.cache/pypoetry/virtualenvs/spleeter-iG7E_J6Q-py3.8/lib/python3.8/site-packages/tensorflow/python/training/monitored_session.py", line 775, in run
    return self._sess.run(
  File "/home/tetsfr/.cache/pypoetry/virtualenvs/spleeter-iG7E_J6Q-py3.8/lib/python3.8/site-packages/tensorflow/python/training/monitored_session.py", line 1280, in run
    return self._sess.run(
  File "/home/tetsfr/.cache/pypoetry/virtualenvs/spleeter-iG7E_J6Q-py3.8/lib/python3.8/site-packages/tensorflow/python/training/monitored_session.py", line 1385, in run
    raise six.reraise(*original_exc_info)
  File "/home/tetsfr/.cache/pypoetry/virtualenvs/spleeter-iG7E_J6Q-py3.8/lib/python3.8/site-packages/six.py", line 703, in reraise
    raise value
  File "/home/tetsfr/.cache/pypoetry/virtualenvs/spleeter-iG7E_J6Q-py3.8/lib/python3.8/site-packages/tensorflow/python/training/monitored_session.py", line 1370, in run
    return self._sess.run(*args, **kwargs)
  File "/home/tetsfr/.cache/pypoetry/virtualenvs/spleeter-iG7E_J6Q-py3.8/lib/python3.8/site-packages/tensorflow/python/training/monitored_session.py", line 1438, in run
    outputs = _WrappedSession.run(
  File "/home/tetsfr/.cache/pypoetry/virtualenvs/spleeter-iG7E_J6Q-py3.8/lib/python3.8/site-packages/tensorflow/python/training/monitored_session.py", line 1201, in run
    return self._sess.run(*args, **kwargs)
  File "/home/tetsfr/.cache/pypoetry/virtualenvs/spleeter-iG7E_J6Q-py3.8/lib/python3.8/site-packages/tensorflow/python/client/session.py", line 967, in run
    result = self._run(None, fetches, feed_dict, options_ptr,
  File "/home/tetsfr/.cache/pypoetry/virtualenvs/spleeter-iG7E_J6Q-py3.8/lib/python3.8/site-packages/tensorflow/python/client/session.py", line 1190, in _run
    results = self._do_run(handle, final_targets, final_fetches,
  File "/home/tetsfr/.cache/pypoetry/virtualenvs/spleeter-iG7E_J6Q-py3.8/lib/python3.8/site-packages/tensorflow/python/client/session.py", line 1368, in _do_run
    return self._do_call(_run_fn, feeds, fetches, targets, options,
  File "/home/tetsfr/.cache/pypoetry/virtualenvs/spleeter-iG7E_J6Q-py3.8/lib/python3.8/site-packages/tensorflow/python/client/session.py", line 1394, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.UnknownError: 2 root error(s) found.
  (0) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
     [[node conv2d/Conv2D (defined at /spleeter/spleeter/model/functions/unet.py:109) ]]
     [[strided_slice_23/_907]]
  (1) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
     [[node conv2d/Conv2D (defined at /spleeter/spleeter/model/functions/unet.py:109) ]]
0 successful operations.
0 derived errors ignored.

Errors may have originated from an input operation.
Input Source operations connected to node conv2d/Conv2D:
 strided_slice_3 (defined at /spleeter/spleeter/model/__init__.py:305)

Input Source operations connected to node conv2d/Conv2D:
 strided_slice_3 (defined at /spleeter/spleeter/model/__init__.py:305)

Original stack trace for 'conv2d/Conv2D':
  File "/.cache/pypoetry/virtualenvs/spleeter-iG7E_J6Q-py3.8/bin/spleeter", line 5, in <module>
    entrypoint()
  File "/spleeter/spleeter/__main__.py", line 256, in entrypoint
    spleeter()
  File "/.cache/pypoetry/virtualenvs/spleeter-iG7E_J6Q-py3.8/lib/python3.8/site-packages/typer/main.py", line 214, in __call__
    return get_command(self)(*args, **kwargs)
  File "/.cache/pypoetry/virtualenvs/spleeter-iG7E_J6Q-py3.8/lib/python3.8/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/.cache/pypoetry/virtualenvs/spleeter-iG7E_J6Q-py3.8/lib/python3.8/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/.cache/pypoetry/virtualenvs/spleeter-iG7E_J6Q-py3.8/lib/python3.8/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/.cache/pypoetry/virtualenvs/spleeter-iG7E_J6Q-py3.8/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/.cache/pypoetry/virtualenvs/spleeter-iG7E_J6Q-py3.8/lib/python3.8/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/.cache/pypoetry/virtualenvs/spleeter-iG7E_J6Q-py3.8/lib/python3.8/site-packages/typer/main.py", line 497, in wrapper
    return callback(**use_params)  # type: ignore
  File "/spleeter/spleeter/__main__.py", line 128, in separate
    separator.separate_to_file(
  File "/spleeter/spleeter/separator.py", line 378, in separate_to_file
    sources = self.separate(waveform, audio_descriptor)
  File "/spleeter/spleeter/separator.py", line 319, in separate
    return self._separate_tensorflow(waveform, audio_descriptor)
  File "/spleeter/spleeter/separator.py", line 301, in _separate_tensorflow
    prediction = next(prediction_generator)
  File "/.cache/pypoetry/virtualenvs/spleeter-iG7E_J6Q-py3.8/lib/python3.8/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 612, in predict
    estimator_spec = self._call_model_fn(features, None, ModeKeys.PREDICT,
  File "/.cache/pypoetry/virtualenvs/spleeter-iG7E_J6Q-py3.8/lib/python3.8/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1163, in _call_model_fn
    model_fn_results = self._model_fn(features=features, **kwargs)
  File "/spleeter/spleeter/model/__init__.py", line 568, in model_fn
    return builder.build_predict_model()
  File "/spleeter/spleeter/model/__init__.py", line 516, in build_predict_model
    tf.estimator.ModeKeys.PREDICT, predictions=self.outputs
  File "/spleeter/spleeter/model/__init__.py", line 318, in outputs
    self._build_outputs()
  File "/spleeter/spleeter/model/__init__.py", line 499, in _build_outputs
    self._outputs = self._build_output_waveform(self.masked_stfts)
  File "/spleeter/spleeter/model/__init__.py", line 342, in masked_stfts
    self._build_masked_stfts()
  File "/spleeter/spleeter/model/__init__.py", line 465, in _build_masked_stfts
    for instrument, mask in self.masks.items():
  File "/spleeter/spleeter/model/__init__.py", line 336, in masks
    self._build_masks()
  File "/spleeter/spleeter/model/__init__.py", line 432, in _build_masks
    output_dict = self.model_outputs
  File "/spleeter/spleeter/model/__init__.py", line 312, in model_outputs
    self._build_model_outputs()
  File "/spleeter/spleeter/model/__init__.py", line 211, in _build_model_outputs
    self._model_outputs = apply_model(
  File "/spleeter/spleeter/model/functions/unet.py", line 197, in unet
    return apply(apply_unet, input_tensor, instruments, params)
  File "/spleeter/spleeter/model/functions/__init__.py", line 44, in apply
    output_dict[out_name] = function(
  File "/spleeter/spleeter/model/functions/unet.py", line 109, in apply_unet
    conv1 = conv2d_factory(conv_n_filters[0], (5, 5))(input_tensor)
  File "/.cache/pypoetry/virtualenvs/spleeter-iG7E_J6Q-py3.8/lib/python3.8/site-packages/tensorflow/python/keras/engine/base_layer_v1.py", line 783, in __call__
    outputs = call_fn(cast_inputs, *args, **kwargs)
  File "/.cache/pypoetry/virtualenvs/spleeter-iG7E_J6Q-py3.8/lib/python3.8/site-packages/tensorflow/python/keras/layers/convolutional.py", line 249, in call
    outputs = self._convolution_op(inputs, self.kernel)
  File "/.cache/pypoetry/virtualenvs/spleeter-iG7E_J6Q-py3.8/lib/python3.8/site-packages/tensorflow/python/util/dispatch.py", line 206, in wrapper
    return target(*args, **kwargs)
  File "/.cache/pypoetry/virtualenvs/spleeter-iG7E_J6Q-py3.8/lib/python3.8/site-packages/tensorflow/python/ops/nn_ops.py", line 1012, in convolution_v2
    return convolution_internal(
  File "/.cache/pypoetry/virtualenvs/spleeter-iG7E_J6Q-py3.8/lib/python3.8/site-packages/tensorflow/python/ops/nn_ops.py", line 1142, in convolution_internal
    return op(
  File "/.cache/pypoetry/virtualenvs/spleeter-iG7E_J6Q-py3.8/lib/python3.8/site-packages/tensorflow/python/ops/nn_ops.py", line 2596, in _conv2d_expanded_batch
    return gen_nn_ops.conv2d(
  File "/.cache/pypoetry/virtualenvs/spleeter-iG7E_J6Q-py3.8/lib/python3.8/site-packages/tensorflow/python/ops/gen_nn_ops.py", line 969, in conv2d
    _, _, _op, _outputs = _op_def_library._apply_op_helper(
  File "/.cache/pypoetry/virtualenvs/spleeter-iG7E_J6Q-py3.8/lib/python3.8/site-packages/tensorflow/python/framework/op_def_library.py", line 748, in _apply_op_helper
    op = g._create_op_internal(op_type_name, inputs, dtypes=None,
  File "/.cache/pypoetry/virtualenvs/spleeter-iG7E_J6Q-py3.8/lib/python3.8/site-packages/tensorflow/python/framework/ops.py", line 3557, in _create_op_internal
    ret = Operation(
  File "/.cache/pypoetry/virtualenvs/spleeter-iG7E_J6Q-py3.8/lib/python3.8/site-packages/tensorflow/python/framework/ops.py", line 2045, in __init__
    self._traceback = tf_stack.extract_stack_for_node(self._c_op)

I have read that it may come from tf not seeing cudnn but when check my python3 environment from the virtual env, I do not see any tf issue at load time:

(spleeter-iG7E_J6Q-py3.8) tetsfr@tetsfr:~/spleeter$ python3
Python 3.8.10 (default, Sep 28 2021, 16:10:42) 
[GCC 9.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
2021-11-15 02:02:06.197757: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
>>> tf.compat.v1.Session()
2021-11-15 02:03:04.095697: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcuda.so.1
2021-11-15 02:03:04.135577: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-11-15 02:03:04.135973: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties: 
pciBusID: 0000:01:00.0 name: NVIDIA GeForce RTX 3090 computeCapability: 8.6
coreClock: 1.785GHz coreCount: 82 deviceMemorySize: 23.69GiB deviceMemoryBandwidth: 871.81GiB/s
2021-11-15 02:03:04.135990: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
2021-11-15 02:03:04.138273: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublas.so.11
2021-11-15 02:03:04.138298: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublasLt.so.11
2021-11-15 02:03:04.139412: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcufft.so.10
2021-11-15 02:03:04.139556: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcurand.so.10
2021-11-15 02:03:04.139846: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcusolver.so.11
2021-11-15 02:03:04.140356: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcusparse.so.11
2021-11-15 02:03:04.140426: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudnn.so.8
2021-11-15 02:03:04.140478: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-11-15 02:03:04.140889: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-11-15 02:03:04.141260: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1871] Adding visible gpu devices: 0
2021-11-15 02:03:04.141474: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-11-15 02:03:04.142074: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-11-15 02:03:04.142526: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties: 
pciBusID: 0000:01:00.0 name: NVIDIA GeForce RTX 3090 computeCapability: 8.6
coreClock: 1.785GHz coreCount: 82 deviceMemorySize: 23.69GiB deviceMemoryBandwidth: 871.81GiB/s
2021-11-15 02:03:04.142569: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-11-15 02:03:04.142973: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-11-15 02:03:04.143415: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1871] Adding visible gpu devices: 0
2021-11-15 02:03:04.143435: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
2021-11-15 02:03:04.443087: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1258] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-11-15 02:03:04.443111: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1264]      0 
2021-11-15 02:03:04.443116: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1277] 0:   N 
2021-11-15 02:03:04.443216: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-11-15 02:03:04.443612: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-11-15 02:03:04.443977: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-11-15 02:03:04.444332: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1418] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 15645 MB memory) -> physical GPU (device: 0, name: NVIDIA GeForce RTX 3090, pci bus id: 0000:01:00.0, compute capability: 8.6)
<tensorflow.python.client.session.Session object at 0x7faa7a0e1e50>

Any idea on what is wrong?

Could that come from my CUDA v11.3 version and some issues with TF2.5 or with Spleeter?

On my regular python3 environment, I have TF2.8 installed and it works fine with other repo I have tried, but from experience TF is a backward compatible nightmare so, hum yeah.

Thanks for your help

Tetsujinfr commented 3 years ago

OK,

so after some more search I found a solution which worked like magic. I needed to set a sys variable to true and then no more TF memory crash.

export TF_FORCE_GPU_ALLOW_GROWTH=true
(spleeter-iG7E_J6Q-py3.8) tetsfr@tetsfr:~/spleeter$ spleeter separate -o output audio_example.mp3 
INFO:spleeter:File output/audio_example/vocals.wav written succesfully
INFO:spleeter:File output/audio_example/accompaniment.wav written succesfully

Oh boy, I hate so much TF. Problem solved anyway, closing this.