SYSTRAN / faster-whisper

Faster Whisper transcription with CTranslate2
MIT License
9.97k stars 839 forks source link

Limited GPU Utilization with NVIDIA RTX 4000 Ada Gen #844

Open James-Shared-Studios opened 1 month ago

James-Shared-Studios commented 1 month ago

I am experiencing limited GPU utilization with the NVIDIA RTX 4000 Ada Gen card while running on Windows 10 1809 CPU: AMD EPYC 3251 8-Core Processor 2.5 GHz RAM: 32GB GPU: NVIDIA RTX 4000 Ada Gen 20 GB CUDA Toolkit Version: 12.3 GPU Driver Version: 546.12

Python code:

    device = 'cuda'
    compute_type = 'int8_float16'
    model_size = 'medium.en'

    print(f"Loading model...")

    start_time = time.time()
    model = WhisperModel(model_size, device=device, 
                         compute_type=compute_type)
    end_time = time.time()
    execution_time = end_time - start_time
    print(f"Model loading time: {execution_time:.2f} seconds")
    folder_path = r"C:\Users\XYZ\Downloads\AI voice"
    max_new_tokens = 10
    beam_size = 10

    for filename in os.listdir(folder_path):
        if filename.endswith(".mp3") or filename.endswith(".m4a") or filename.endswith(".mp4") or filename.endswith(".wav"):
            file_path = os.path.join(folder_path, filename)
            print(f"Transcribing file: {file_path}")
            start_time = time.time()
            segments, _ = model.transcribe(file_path,
                                           beam_size=beam_size,
                                           max_new_tokens=max_new_tokens,
                                           word_timestamps = False,
                                           prepend_punctuations = "",
                                           append_punctuations = "",
                                           language="en", condition_on_previous_text=False)
            for segment in segments:
                print("[%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text))
            end_time = time.time()
            execution_time = end_time - start_time
            print(f"Execution time: {execution_time:.2f} seconds")
            total_processing_time += execution_time

While running my code, I'm only observing around 10% GPU utilization.

image

However, the same code achieves 100% utilization on an NVIDIA GeForce RTX 4070.

image
Napuh commented 1 month ago

Try to repeat the test but show the CUDA graph which shows CUDA utilization.

To do that, click here: image And select CUDA

James-Shared-Studios commented 1 month ago
image image

for CUDA it's barely reached 70% utilization

Napuh commented 1 month ago

How does it compare with a bigger model?

phineas-pta commented 1 month ago

u should compare speed

utilization matters less

James-Shared-Studios commented 1 month ago

u should compare speed

utilization matters less

The average processing time with GeForce 4070 is 0.16 seconds, compared to 0.51 seconds with RTX 4000 Ada. I would expect faster performance from RTX 4000 Ada, that's why I was wondering if the RTX 4000 Ada has been limited in some way.

James-Shared-Studios commented 1 month ago

How does it compare with a bigger model?

the same results for large-v1, large-v2 and large-v3

image
phineas-pta commented 1 month ago

I would expect faster performance from RTX 4000 Ada

no u should expect the inverse: 4070 is faster

James-Shared-Studios commented 1 month ago

I would expect faster performance from RTX 4000 Ada

no u should expect the inverse: 4070 is faster

why is that? could you provide more context please? Thank you.

phineas-pta commented 1 month ago

since the model can fit to gpu, vram is not a factor, it comes down to memory bandwidth (more impactful when cuda cores count isnt much different)

u can take a look at their theoretical fp32 & fp16 performance:

James-Shared-Studios commented 1 month ago

4070 FP16 (half) 29.15 TFLOPS vs RTX 4000 Ada FP16 (half) 26.73 TFLOPS (1:1) so RTX 4000 Ada should not be three times slower than 4070, correct?

phineas-pta commented 1 month ago

the execution time is too short, there's additionally i/o overhead

for better benchmark, use longer audio/video to reduce overhead time part

James-Shared-Studios commented 1 month ago

the execution time is too short, there's additionally i/o overhead

for better benchmark, use longer audio/video to reduce overhead time part

That makes sense. I will try a longer audio and see if it improves the results. Thank you so much for your help.

Napuh commented 4 weeks ago

What's the conclusion? @James-Shared-Studios