Const-me / Whisper

High-performance GPGPU inference of OpenAI's Whisper automatic speech recognition (ASR) model
Mozilla Public License 2.0
7.76k stars 671 forks source link

Multiple GPUs of Same Name #72

Open adamreed90 opened 1 year ago

adamreed90 commented 1 year ago

using iModel model = Library.loadModel(cla.model); I am doing some testing with 2 x RTX 3070s that are showing as the same name, I think it would be helpful to adjust this to use an integer based index for selecting the GPU for cases like this. đź‘Ť

adamreed90 commented 1 year ago

I modified selectAdapter to take an index or an adapter name, this worked out great!

(Full Disclosure, I used ChatGPT for this ... :( )

listGPUS.cpp:

CComPtr<IDXGIAdapter1> selectAdapter(const std::wstring& requestedName)
    {
        if (requestedName.empty())
            return nullptr;

        CComPtr<IDXGIFactory1> dxgi;
        HRESULT hr = createFactory(dxgi);
        if (FAILED(hr))
        {
            logWarningHr(hr, u8"CreateDXGIFactory1 failed");
            return nullptr;
        }

        std::wstring name;
        UINT index = UINT_MAX;

        // Check if the requested name is a number (i.e., index)
        try {
            index = std::stoi(requestedName);
        }
        catch (std::invalid_argument&) {
            // The requested name is not a number; proceed with name lookup
        }

        for (UINT i = 0; true; i++)
        {
            CComPtr<IDXGIAdapter1> adapter;
            hr = dxgi->EnumAdapters1(i, &adapter);
            if (hr == DXGI_ERROR_NOT_FOUND)
            {
                logWarning16(L"Requested GPU not found: \"%s\"", requestedName.c_str());
                return nullptr;
            }

            if (FAILED(hr))
            {
                logErrorHr(hr, u8"IDXGIFactory1.EnumAdapters1 failed");
                return nullptr;
            }

            DXGI_ADAPTER_DESC1 desc;
            adapter->GetDesc1(&desc);
            setName(name, desc);

            if (index != UINT_MAX && index == i)
                return adapter;
            else if (name == requestedName)
                return adapter;
        }
    }
adamreed90 commented 1 year ago

Using these changes I was able to fit 3 ggml-medium.en.bin models into 2 x RTX 3070s, and handle 6 simultaneous transcriptions with an ASP.NET Core API I built.

Thank you so much @Const-me for your work on this project, it's quite impressive!

maxaki commented 1 year ago

Using these changes I was able to fit 3 ggml-medium.en.bin models into 2 x RTX 3070s, and handle 6 simultaneous transcriptions with an ASP.NET Core API I built.

Thank you so much @Const-me for your work on this project, it's quite impressive!

I implemented something similar like this but with 4x RTX 3700. However, it wasn't stable at all under the same OS. Even though they're assigned to different direct3d device they're somehow sharing some resources anyway. If you like to squeeze out more performance out of your GPU's I recommend running it under 2 virtual machines. We're using 4 and can run the GPU's at 100% load with minor throttling due to temperature.

adamreed90 commented 1 year ago

Using these changes I was able to fit 3 ggml-medium.en.bin models into 2 x RTX 3070s, and handle 6 simultaneous transcriptions with an ASP.NET Core API I built. Thank you so much @Const-me for your work on this project, it's quite impressive!

I implemented something similar like this but with 4x RTX 3700. However, it wasn't stable at all under the same OS. Even though they're assigned to different direct3d device they're somehow sharing some resources anyway. If you like to squeeze out more performance out of your GPU's I recommend running it under 2 virtual machines. We're using 4 and can run the GPU's at 100% load with minor throttling due to temperature.

I noticed something strangely similar, the first GPU would use 5.9GB/8GB and the second only 2.9/8GB of VRAM, however they would both do 3 simul. transcriptions.

Virtualization unfortunately isn't an option with the intended hardware setup I have available.

maxaki commented 1 year ago

Using these changes I was able to fit 3 ggml-medium.en.bin models into 2 x RTX 3070s, and handle 6 simultaneous transcriptions with an ASP.NET Core API I built. Thank you so much @Const-me for your work on this project, it's quite impressive!

I implemented something similar like this but with 4x RTX 3700. However, it wasn't stable at all under the same OS. Even though they're assigned to different direct3d device they're somehow sharing some resources anyway. If you like to squeeze out more performance out of your GPU's I recommend running it under 2 virtual machines. We're using 4 and can run the GPU's at 100% load with minor throttling due to temperature.

I noticed something strangely similar, the first GPU would use 5.9GB/8GB and the second only 2.9/8GB of VRAM, however they would both do 3 simul. transcriptions.

Virtualization unfortunately isn't an option with the intended hardware setup I have available.

Shouldn't be an issue with hyperv pci passthrough and install nvidia drivers natively on the vm. GeForce series isn't officially supported by microsoft&nvidia but there's easy workarounds.

adamreed90 commented 1 year ago

Using these changes I was able to fit 3 ggml-medium.en.bin models into 2 x RTX 3070s, and handle 6 simultaneous transcriptions with an ASP.NET Core API I built. Thank you so much @Const-me for your work on this project, it's quite impressive!

I implemented something similar like this but with 4x RTX 3700. However, it wasn't stable at all under the same OS. Even though they're assigned to different direct3d device they're somehow sharing some resources anyway. If you like to squeeze out more performance out of your GPU's I recommend running it under 2 virtual machines. We're using 4 and can run the GPU's at 100% load with minor throttling due to temperature.

I noticed something strangely similar, the first GPU would use 5.9GB/8GB and the second only 2.9/8GB of VRAM, however they would both do 3 simul. transcriptions. Virtualization unfortunately isn't an option with the intended hardware setup I have available.

Shouldn't be an issue with hyperv pci passthrough and install nvidia drivers natively on the vm. GeForce series isn't officially supported by microsoft&nvidia but there's easy workarounds.

Unfortunately I'm using a special purpose SBC not a standard PC with very limited resources and capabilities, it wouldn't handle multiple Windows VMs. My end goal is to get this working on Linux in Containers.

Const-me commented 1 year ago

@adamreed90 I like the idea of integer index in that string, it’s simple and it works. Will be fixed in the next version. In the meantime, update from master and build.

One more thing, if you create multiple contexts to run on the same GPU, try the clone() workflow https://github.com/Const-me/Whisper/issues/49#issuecomment-1474915688 Should help with VRAM use because it causes model’s tensors to be shared, instead of making copies.

maxaki commented 1 year ago

@adamreed90 I like the idea of integer index in that string, it’s simple and it works. Will be fixed in the next version. In the meantime, update from master and build.

One more thing, if you create multiple contexts to run on the same GPU, try the clone() workflow #49 (comment) Should help with VRAM use because it causes model’s tensors to be shared, instead of making copies.

I did some tests with that on RTX 3700 (8gb vram) and the output relative speed remained the same running 1 and 2 instances. The performance just cut in half when cloning. Tried most things, currently the only speed increase I can see is combining multiple audio buffers into one rather than repeating runFull for each file.

adamreed90 commented 1 year ago

@adamreed90 I like the idea of integer index in that string, it’s simple and it works. Will be fixed in the next version. In the meantime, update from master and build.

One more thing, if you create multiple contexts to run on the same GPU, try the clone() workflow #49 (comment) Should help with VRAM use because it causes model’s tensors to be shared, instead of making copies.

I've been having issues getting this to work via .NET, will try a bit more then post back specific errors.

adamreed90 commented 1 year ago

@adamreed90 I like the idea of integer index in that string, it’s simple and it works. Will be fixed in the next version. In the meantime, update from master and build. One more thing, if you create multiple contexts to run on the same GPU, try the clone() workflow #49 (comment) Should help with VRAM use because it causes model’s tensors to be shared, instead of making copies.

I did some tests with that on RTX 3700 (8gb vram) and the output relative speed remained the same running 1 and 2 instances. The performance just cut in half when cloning. Tried most things, currently the only speed increase I can see is combining multiple audio buffers into one rather than repeating runFull for each file.

@maxaki Did you manage to get any improved performance out of concurrent transcriptions?