Open mikolodz opened 1 year ago
The space is moving quite quickly, and at this point I'd actually recommend loading your dataset into the superbooga extension and checking to see if that matches your use case for the dataset. I actually get really good results using that with my unrealdocs.txt, and I can use it with any model which is a nice bonus. Otherwise you may need to tweak the layout of your dataset and play around with the cutoff length. Also if you're using a 4bit model make sure you're doing all the best practices for 4bit LoRA training.
Thanks, mate!
In fact, I did use both langchain + openai and superbooga without success. If you have similar information put in the vector everything depends on the quality of the embeddings. In my case, wrong chunks are put in the context most of the time and I get responses like "IDK what is product X, although I can share information regarding products Y and Z".
I'll push the ranks in the lora training higher and see what happens. That's the dataset that contains a whole bunch of details, so maybe it needs extremely high ranks. Otherwise will have to experiment more with other embedding types. There's not a lot of information on that topic (with comparisons) so it's gonna be a real pain.
Did you end up getting good results when you pushed rank high enough?
Vicuna has been the best model for this and llama2 is even better, and I've experimented with both the 7B and 13B various models from thebloke.
If you are utilizing Superbooga, here's what has worked for me:
Add an ending code: After each paragraph or sentence in your text file, I add an ending code like </S>
. You can easily do this by opening the text file in a visual code editor and using regex. Find the end of the sentences with Ctrl+F
and replace all with your special code.
Chunk the content: Instruct Superbooga to chunk the content until each code.
Load your model: If you haven't already, load your model.
Match the template: Ensure that the instruct template matches what the model was trained on. For Vicuna 1.5, use the Vicuna1.1 template.
Check your parameters: The Simple 1 template is adequate, but I recommend lowering the temperature slider to 0.4, extend the max_new_tokens, auto_max_new_tokens = true, and "Truncate the prompt up to this length" to the maximum setting the model can handle (I use 13k as the max). These are critical.
Test in instruct mode: Make sure you use INSTRUCT only.
If you want to elevate your efforts and don't mind some coding, you can enable the OpenAI extension and use the embeddings endpoint to convert your data. Check the text generation webui readme to reqd more about the extension. I found it easiest to chunk my data myself, run it through the embeddings API, and save the output embeddings in JSON along with the text. Later, when a user asks a question, I embed it, then compare using similarity cosine to my JSON vector database, return the top 3 text results and feed it to the model with the system prompts (the task, and results) and user prompt. This method has proven highly effective for me on closed-domain data that the model hasn't seen before. I've spent considerable time perfecting the system prompt to ensure that it's both specific and to the point, unlike ChatGPT. Spend a little more time with the prompt. Use ChatGPT to help you.
Isn't Superbooga just vector DB Q&A? The data I'm working is too much to merely chunk and vectorize—it demands finetuning.
can you show me a sample, maybe I can help you
I have a large number of earnings calls that I want the model to be able to extract conclusions from without constantly having to retrieval augment—especially when people ask questions that require something on the order of 15+ documents.
The thing is that every use case is a little different and the default approach doesn't always work. At the end of the day, it's better to have structured data. With the context length reaching 32k for (lmsys vicuna llama 2) you can use the llm itself to structure it for you.
Fine tuning is not what you need unless you have a dataset for your data in a question answer form. You can do this by collecting responses and queries from your model.
To use superbooga with your corp you'll need to tweak the prompt superbooga uses. Go to the Superbooga extension folder and edit the py with the prompt that tells the model what to do with the results. Think of this prompt like a system prompt.
Also reloading data files whenever you reload a model is not ideal in Superbooga so you might want to start thinking of creating your own vector database.
If all of this is above your knowledge give Claude.ai a look. It's free and has a 100k context length and in my opinion, its a tad better than Chatgpt4.
Vicuna has been the best model for this and llama2 is even better, and I've experimented with both the 7B and 13B various models from thebloke.
Thanks for your hints. I am only starting with all this and downloaded the "wrong" Vicuna models I assume.
Could you share the exact model name(s) and which loader you use? Thanks! For context: I test with a "RTX2060 Super" with 8GB VRAM ... so I have to stay with rather small models, I assume.
Hello!
Thanks for sharing your code! I'm struggling a bit with a dataset containg hundreds of product specifications incl attributes, descriptions etc. There are some similar products in the dataset e.g. different sizes of the same product, although the quality of the dataset is rather high. But i'm getting wrong responses from the model (wizard vicuna gptq 4b). For instance I ask about product X and getting description of another product. I even redesigned the dataset so that it would contain product name in every piece of information, so for example in the specification section it says "Product X specification:". This way I made sure that the model will be aware of to which product does the specification belong to. Didn't help at all.
I'm bit disappointed. I have tried training using both Lora alpha value = 64 and 256. Similar outcome - rubbish. Previously I've done some training on other datasets like sensmaking.json (which worked properly), alpaca_pl.json (to train the polish language - it resulted in some dual-language responses which was somewhat disturbing), but this time I have a high quality database/documentation and I'm getting really poor results.
Maybe you have some observation or idea how could I train on such set? Thanks!