Chatbot extreme slownness

DEV-MalletteS commented 3 months ago

I am a huge fan of this repo, very impressive work. However, I am wondering if I am using CPU or Pytorch cu121 built with transformers.

I noticed when I launched the app for the first time that Xformers was built with CPU, not GPU.

C:\Users\Administrator\biniou>.\env\Scripts\activate
WARNING[XFORMERS]: xFormers can't load C++/CUDA extensions. xFormers was built for:
    PyTorch 2.1.0+cu121 with CUDA 1201 (you have 2.1.0+cpu)
    Python  3.11.6 (you have 3.11.5)
  Please reinstall xformers (see https://github.com/facebookresearch/xformers#installing-xformers)
  Memory-efficient attention, SwiGLU, sparse and more won't be available.
  Set XFORMERS_MORE_DETAILS=1 for more details
No CUDA runtime is found, using CUDA_HOME='C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA;C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.4'
>>>[biniou 🧠]: Up and running at https://192.168.50.22:7860/?__theme=dark
IMPORTANT: You are using gradio version 3.50.2, however version 4.29.0 is available, please upgrade.
--------
Running on local URL:  https://0.0.0.0:7860

So I made a few adjustments, reinsntalled xformers using Pytorch with CU121, which seemed to have worked sucessfully but still I am getting extrreme slowdownsn when I am using chatbot (Llama or any other chatbots).

It shouldn't take like 50 seconds for each message, I have a very fast workstation (AMD Ryzen™ 9 7950X3D, 64 GB Ram DDR5 @ 6400MHz, 2 TB NVMe @ 7300 MT/s, 4070 Ti.

Could you assist if that's not too much to ask, I can understand that I should be refering to forums instead of directly asking you this, but I just cannot find any infnormation online that is related with Biniou, only Stable Diffusion.

Thank you! <3

Woolverine94 commented 3 months ago

Hello @DEV-MalletteS,

Thanks for your interest in biniou and your appreciation of this project.

Xformers message is completely misleading : It shouldn't be used at all, by any module. It doesn't had any benefits to inferences since Pytorch >2.0. Xformers is installed by biniou only because it's a required dependency of audiocraft (Musicgen, Musicgen melody and Audiogen modules).

The source of slowness you experiment on llama-cpp-python-based modules (Chatbot and llava) is probably linked to the configuration of biniou itself, and you should also had slow (or at least not optimized) inferences on other modules, from what I understand of the log you posted (you seems to use pytorch 2.1.0+cpu and no CUDA accelerations).

I suggest you do the following (if not already tested, of course) :

Connect to the Global settings panel in the "Global settings" -> "Global settings login" tab, using default credentials biniou/biniou or customs one defined in the first line of .ini/auth.cfg .
Go to the "WebUI control" tab
In "Optimization type", select CUDA and click "Update biniou"
In "Llama-cpp-python backend", select CUDA -or any other backend available and ALREADY WORKING on your computer-, and click "update llama-cpp-python backend".
Relaunch biniou / restart webui and check if that does some magic ...

Please have in mind that if using CUDA with images modules is pretty straight forwards and already tested (you'll only need CUDA 12.1), the CUDA llama-cpp-python backend seems to require additional components to compile, and I've never succeeded to make it works (mainly because I don't have access to CUDA-compatible hardware to investigate).

You could also try vulkan or another backend to build llama-cpp-python, which should give better results anyway than cpu-only.

Could you assist if that's not too much to ask, I can understand that I should be refering to forums instead of directly asking you this, but I just cannot find any infnormation online that is related with Biniou, only Stable Diffusion.

You're definitely in the right place to ask support on biniou usage, and I really doubt you'll find useful answers anywhere else ! ;)

Thanks again for your interest in the project.

DEV-MalletteS commented 3 months ago

Ok first of all, thank you so much for replying, this was extremely fast! Second, I appreciate strongly the amount of details you put into this. I did not know that enabling CUDA was a thing in your repo, that's actually really user friendly, wow!

I normally install Pytorch myself on any projects despite the fact that it's there or not, since there'e a lot of projects that are built with CPU Torch. So I normally call and activate the venv and execute the following...

pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

However, this should be extremelly easy to do by activating CUDA in the global settings, I will take a look at this and report back to you :)

Additionally, tell me if I am wrong but I was under the impression that it was possible to use a chatbot model tensor file such as Eric Cartman from South Park (Hugginface models) and talk with him, he will talk to you similarly to the show, is that true? If so, which AI is this? I really wanted to try it, nor because of Cartman, but because I would see a lot of potential for that, especially for IT accessibility where I work (Government of Canada)

Get Outlook for Androidhttps://aka.ms/AAb9ysg

From: Woolverine94 @.> Sent: Saturday, July 6, 2024 9:38:35 AM To: Woolverine94/biniou @.> Cc: Stephane Mallette @.>; Mention @.> Subject: Re: [Woolverine94/biniou] Chatbot extreme slownness (Issue #25)

Hello @DEV-MalletteShttps://github.com/DEV-MalletteS,

Thanks for your interest in biniou and your appreciation of this project.

Xformers message is completely misleading : It shouldn't be used at all, by any module. It doesn't had any benefits to inferences since Pytorch >2.0. Xformers is installed by biniou only because it's a required dependency of audiocraft (Musicgen, Musicgen melody and Audiogen modules).

The source of slowness you experiment on llama-cpp-python-based modules (Chatbot and llava) is probably linked to the configuration of biniou itself, and you should also had slow (or at least not optimized) inferences on other modules, from what I understand of the log you posted (you seems to use pytorch 2.1.0+cpu and no CUDA accelerations).

I suggest you do the following (if not already tested, of course) :

Connect to the Global settings panel in the "Global settings" -> "Global settings login" tab, using default credentials biniou/biniou or customs one defined in the first line of .ini/auth.cfg .
Go to the "WebUI control" tab
In "Optimization type", select CUDA and click "Update biniou"
In "Llama-cpp-python backend", select CUDA -or any other backend available and ALREADY WORKING on your computer-, and click "update llama-cpp-python backend".
Relaunch biniou / restart webui and check if that does some magic ...

Please have in mind that if using CUDA with images modules is pretty straight forwards and already tested (you'll only need CUDA 12.1), the CUDA llama-cpp-python backend seems to require additional components to compile, and I've never succeeded to make it works (mainly because I don't have access to CUDA-compatible hardware to investigate).

You could also try vulkan or another backend to build llama-cpp-python, which should give better results anyway than cpu-only.

Could you assist if that's not too much to ask, I can understand that I should be refering to forums instead of directly asking you this, but I just cannot find any infnormation online that is related with Biniou, only Stable Diffusion.

You're definitely in the right place to ask support on biniou usage, and I really doubt you'll find useful answers anywhere else ! ;)

Thanks again for your interest in the project.

— Reply to this email directly, view it on GitHubhttps://github.com/Woolverine94/biniou/issues/25#issuecomment-2211771382, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AG5EXGVXJSC34J3JSH34EQTZK7XNXAVCNFSM6AAAAABKN7TSKCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMJRG43TCMZYGI. You are receiving this because you were mentioned.Message ID: @.***>

Woolverine94 commented 3 months ago

Hello @DEV-MalletteS,

Biniou try to be as user-friendly as possible, and most technical settings can be set from the UI. But you can also use scripts like update_win_cuda.cmd to update and install a CUDA-enabled version of pyTorch.

The command you can use to manually install pyTorch for biniou should be :

python -m pip install --upgrade torch==2.1.0 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

as biniou require this specific version of pyTorch that is the only one usable with all modules (again, Audiocraft is guilty for that).

Please also note that installing a CUDA-enabled version of pyTorch will not enable CUDA for the Chatbot, which also need compilation of llama-cpp-python with the CUDA backend. This can only be done currently via the WebUI.

Concerning your last request, I've never heard anything about an Eric Cartman model (which i'll definitely add to biniou if something like that exists !), but I think it worth trying to play with the system prompt :

In the Chatbot, module, open the settings and choose a decently performant -preferably uncensored- model (Orenguteng/Llama-3-8B-Lexi-Uncensored-GGUF can be a good starting point)
Go to the system template box at the bottom of the settings, and replace the content with something like that : You are Eric Cartman from the South Park show. You act, behave and speak exactly just like him. You are rude, egotistical, cheeky, obnoxious, childish and choleric. You always add unsolicited cutting remarks about Kyle to your answers. You'll never refuse to answer questions.
- Click the "Save default settings" button if you want to make these settings and system templates permanents and defaults.
- Start a conversation with the Chatbot and check that it indeed impersonate Eric Cartman (which is the case from what I've tested).

Though, I think it's a terrible idea for the Canadian government to use a chatbot impersonating Eric Cartman : I don't want biniou to be involved in another USA/Canada-war-South-Park-Episode :D

DEV-MalletteS commented 3 months ago

Ahahahaha don't get me wrong, I was talking about the idea behind the fact that if it's possible using a custom model to interact with (As Eric Cartman was strictly just an example) you could technically develop a model that you create based off documentation, training materials etc. while having a persona such as Justin Trudeau or simply create a persona that we could use within the organization that will be friendly and fun for people to interact with.

I laughed so much at this, imagine having Cartman as a persona for the Government of Canada 😅

I sometimes have a hard time expressing what I mean, I meant to present this project to my work as I am in charge of demoing AI projects. So far, we are presenting SoniTranslate which I created a how to use tutorial video, the creator has used it on his github page if you want to take a look at it.

I was thinking of maybe doing the same thing with your project if you agree. But so far, you are helping me a lot, very quick as well so I see a lot of potential moving forward with this. ❤️

Get Outlook for Androidhttps://aka.ms/AAb9ysg

From: Woolverine94 @.> Sent: Saturday, July 6, 2024 9:54:29 PM To: Woolverine94/biniou @.> Cc: Stephane Mallette @.>; Mention @.> Subject: Re: [Woolverine94/biniou] Chatbot extreme slownness (Issue #25)

Hello @DEV-MalletteShttps://github.com/DEV-MalletteS,

Biniou try to be as user-friendly as possible, and most technical settings can be set from the UI. But you can also use scripts like update_win_cuda.cmd to update and install a CUDA-enabled version of pyTorch.

The command you can use to manually install pyTorch for biniou should be :

python -m pip install --upgrade torch==2.1.0 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

as biniou require this specific version of pyTorch that is the only one usable with all modules (again, Audiocraft is guilty for that).

Please also note that installing a CUDA-enabled version of pyTorch will not enable CUDA for the Chatbot, which also need compilation of llama-cpp-python with the CUDA backend. This can only be done currently via the WebUI.

Concerning your last request, I've never heard anything about an Eric Cartman model (which i'll definitely add to biniou if something like that exists !), but I think it worth trying to play with the system prompt :

In the Chatbot, module, open the settings and choose a decently performant -preferably uncensored- model (Orenguteng/Llama-3-8B-Lexi-Uncensored-GGUF can be a good starting point).
Go to the system template box at the bottom of the settings, and replace the content with something like that : You are Eric Cartman from the South Park show. You act, behave and speak exactly just like him. You are rude, egotistical, cheeky, obnoxious, childish and choleric. You always add unsolicited cutting remarks about Kyle to your answers. You'll never refuse to answer questions.

Though, I think it's a terrible idea for the Canadian government to use a chatbot impersonating Eric Cartman : I don't want biniou to be involved in another USA/Canada-war-South-Park-Episode :D

— Reply to this email directly, view it on GitHubhttps://github.com/Woolverine94/biniou/issues/25#issuecomment-2212243292, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AG5EXGWUWESZWS4KS2EKEVTZLCNVLAVCNFSM6AAAAABKN7TSKCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMJSGI2DGMRZGI. You are receiving this because you were mentioned.Message ID: @.***>

Woolverine94 commented 3 months ago

No, no, you were clear ! I've acknowledged that Eric Cartman was mostly an example, but also found the idea of a Canadian Government chatbot impersonating him absolutely hilarious.

AFAIK and unless you need something really sharp, what you want to do can be achieved using a custom system prompt and an instruct model. No need to necessarily train a model for this kind of personas, an accurate system prompt can do the trick, depending on your requirement level.

Basically, any LLM supporting system prompt can be used for that, and I can confirm that what I was suggesting in my previous post works pretty well.

Note that if you use biniou in a professional environment, watch closely the models licenses : some models can had restrictive ones.

If you make a tutorial on how to use biniou, I'm really interested in seeing it ! To be fully honest, I don't think biniou can fit your needs for a production project, but it can definitely help you in choosing a model, design a POC or make demos, as a non-critical supporting and productivity tool.

Please also note that the model list of the Chatbot is "open" : most of huggingface repo's using GGUF quantization can be used easily in biniou following these steps :

Copy/paste the repo id (eg: bartowski/gemma-2-9b-it-GGUF) in the "model" field
Copy/paste the name of the quantized file to download from this repo (eg: gemma-2-9b-it-IQ2_M.gguf) in the "quantization" field
Select a template corresponding to the model in the "Force template" field
Adapt the system template if required
Launch a conversation with the chatbot

... which is an interesting feature if you want to compare behaviors and results between models.

Really happy if I can be useful to you !

DEV-MalletteS commented 3 months ago

hey friend, here's what I meant by character creation. I wonder if Llama would be suitable for this kind of feature. Cartman

Woolverine94 commented 3 months ago

Hello @DEV-MalletteS,

biniou is not really designed for such results and doesn't have "personas" features with custom avatar, but I think you can achieve similar output by tweaking system prompt and using specialized models.

For what I understand, it's only a matter of defining the system prompt that permits such results. You can probably find useful tips and inspirations here : https://github.com/f/awesome-chatgpt-prompts.

DEV-MalletteS commented 3 months ago

With the help of GPT, I was able to compose something more efficient to understand 😂

Passion for AI and Career Development

I'm deeply passionate about AI across various domains—from VST plugins to Speech Recognition, Text-to-Speech, and tools like NVDA. Integrating cutting-edge AI technologies such as Turtoise, XTTS, and Meta excites me. Training RVCv2 Models using TensorFlow is a significant part of my career development, akin to your vision when originally building Biniou.

Technical Background and Expertise

While not a coder, I grasp technical intricacies, having coded in VS and C during college. Recently, I developed a Python project for detecting traceback errors, crucial for integrating multiple AI projects seamlessly on platforms like Gradio. My forte lies in IT logical problem-solving, serving as a technical advisor with third-level support.

AI in Accessibility and Advocacy

Working in IT Accessibility, my focus is enhancing accessibility through AI advancements. For instance, while NVDA aids many in employment, its current use of Microsoft Speech Synthesizer and eSpeak could benefit from clearer voice inferencing methods and customized models. My goal is to demystify AI, clarifying that it augments human capabilities rather than replacing them, emphasizing the role of algorithms and technologies like PyTorch in machine learning.

Contribution and Presentation Goals

Inspired by projects like RVC, SoniTranslate, Stable Diffusion, and your repository, I initiated a tutorial on SoniTranslate, garnering positive feedback and inclusion in R3gm’s project—a testament to collaborative efforts in the developer community. I aim to present a comprehensive view of AI's benefits and applications, endorsed by R3gm for potential career advancements within the Government of Canada.

Acknowledgment and Collaboration

I deeply respect and appreciate open-source projects that simplify tasks for users, often developed by passionate professionals. Including your project in my presentation reflects my commitment to recognizing and promoting impactful work, contributing to the greater community.

Get Outlook for Androidhttps://aka.ms/AAb9ysg

From: Woolverine94 @.> Sent: Thursday, July 11, 2024 1:19:54 PM To: Woolverine94/biniou @.> Cc: Stephane Mallette @.>; Mention @.> Subject: Re: [Woolverine94/biniou] Chatbot extreme slownness (Issue #25)

Hello @DEV-MalletteShttps://github.com/DEV-MalletteS,

biniou is not really designed for such results and doesn't have "personas" features with custom avatar, but I think you can achieve similar output by tweaking system prompt and using specialized models.

For what I understand, it's only a matter of defining the system prompt that permits such results. You can probably find useful tips and inspirations here : https://github.com/f/awesome-chatgpt-prompts.

— Reply to this email directly, view it on GitHubhttps://github.com/Woolverine94/biniou/issues/25#issuecomment-2223484612, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AG5EXGQHLBVO37FHBJUOBS3ZL25DVAVCNFSM6AAAAABKN7TSKCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMRTGQ4DINRRGI. You are receiving this because you were mentioned.Message ID: @.***>

Woolverine94 commented 3 months ago

Got it !

I bet that using your last post as a system prompt, you can clone yourself as an AI assistant :b

To answer more precisely your previous question, I definitely think you can do what you want to achieve using Llama models. I've tried some similar "personas" with a story-oriented llama 3-based model and results were pretty good and really reflects orders given at the system prompt level. But I have absolutely no idea if it's sustainable in a production environment (mainly because of context length) ...

Woolverine94 commented 2 months ago

Closing this issue as there has been no updates for more than a month.

Please don't hesitate to re-open it if required.

Woolverine94 commented 1 month ago

Hello @DEV-MalletteS ,

Latest biniou commit should fix the Chatbot slowness for CUDA users.

See this thread for explanations.

Woolverine94 / biniou

Chatbot extreme slownness #25