[Bug] Error building extension 'transformer_inference' when using `use_deepspeed`

weijia-yu commented 8 months ago

Describe the bug

I am using manual streaming mode in colab, and it shows the error

CalledProcessError                        Traceback (most recent call last)
[/usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py](https://localhost:8080/#) in _run_ninja_build(build_directory, verbose, error_prefix)
   2099         stdout_fileno = 1
-> 2100         subprocess.run(
   2101             command,

22 frames
CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

RuntimeError                              Traceback (most recent call last)
[/usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py](https://localhost:8080/#) in _run_ninja_build(build_directory, verbose, error_prefix)
   2114         if hasattr(error, 'output') and error.output:  # type: ignore[union-attr]
   2115             message += f": {error.output.decode(*SUBPROCESS_DECODE_ARGS)}"  # type: ignore[union-attr]
-> 2116         raise RuntimeError(message) from e
   2117 
   2118 

RuntimeError: Error building extension 'transformer_inference'

To Reproduce

Go to this colab project https://colab.research.google.com/drive/145YOi_cNbs9nvk4mow-kqh2nAdKcHhYA?usp=sharing
Choose T4 GPU
Run all and it will show the error

If I comment use_deepspeed=True, it will run successfully.

Expected behavior

It should produce the audio and has no build error

Logs

Loading model...
[2024-02-28 06:10:51,383] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-02-28 06:10:51,730] [INFO] [logging.py:96:log_dist] [Rank -1] DeepSpeed info: version=0.10.3, git-hash=unknown, git-branch=unknown
[2024-02-28 06:10:51,733] [WARNING] [config_utils.py:69:_process_deprecated_field] Config parameter replace_method is deprecated. This parameter is no longer needed, please remove from your call to DeepSpeed-inference
[2024-02-28 06:10:51,734] [WARNING] [config_utils.py:69:_process_deprecated_field] Config parameter mp_size is deprecated use tensor_parallel.tp_size instead
[2024-02-28 06:10:51,736] [INFO] [logging.py:96:log_dist] [Rank -1] quantize_bits = 8 mlp_extra_grouping = False, quantize_groups = 1
Using /root/.cache/torch_extensions/py310_cu121 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/py310_cu121/transformer_inference/build.ninja...
Building extension module transformer_inference...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
---------------------------------------------------------------------------
CalledProcessError                        Traceback (most recent call last)
/usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py in _run_ninja_build(build_directory, verbose, error_prefix)
   2099         stdout_fileno = 1
-> 2100         subprocess.run(
   2101             command,

22 frames
CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

RuntimeError                              Traceback (most recent call last)
/usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py in _run_ninja_build(build_directory, verbose, error_prefix)
   2114         if hasattr(error, 'output') and error.output:  # type: ignore[union-attr]
   2115             message += f": {error.output.decode(*SUBPROCESS_DECODE_ARGS)}"  # type: ignore[union-attr]
-> 2116         raise RuntimeError(message) from e
   2117 
   2118 

RuntimeError: Error building extension 'transformer_inference'


### Environment

```shell
{
    "CUDA": {
        "GPU": [
            "Tesla T4"
        ],
        "available": true,
        "version": "12.1"
    },
    "Packages": {
        "PyTorch_debug": false,
        "PyTorch_version": "2.1.0+cu121",
        "TTS": "0.22.0",
        "numpy": "1.22.0"
    },
    "System": {
        "OS": "Linux",
        "architecture": [
            "64bit",
            "ELF"
        ],
        "processor": "x86_64",
        "python": "3.10.12",
        "version": "#1 SMP PREEMPT_DYNAMIC Sat Nov 18 15:31:17 UTC 2023"
    }
}

Additional context

No response

stale[bot] commented 6 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. You might also look our discussion channels.

Airgods commented 3 months ago

这个问题不好解决，似乎是cuda版本的问题。

stale[bot] commented 1 month ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. You might also look our discussion channels.

JUN-ZZ commented 3 weeks ago

@weijia-yu @Airgods 怎么解决的

coqui-ai / TTS