[BUG]: User Query embeddings are being chunked per character when using LM Studio embedding models.

frost19k commented 3 months ago

How are you running AnythingLLM?

AnythingLLM desktop app

What happened?

Document embeddings are chunked properly, but user query embeddings are chunked per character resulting in useless embeddings - when using LM Studio as embedding provider. Default built-in embeddings work properly.

LM Studio Logs

```Bash [2024-04-29 22:21:19.847] [INFO] Received POST request to /v1/embeddings with body: { "model": "nomic-ai/nomic-embed-text-v1.5-GGUF/nomic-embed-text-v1.5.Q8_0.gguf", "input": "D" } [2024-04-29 22:21:19.855] [INFO] Returning embeddings (not shown in logs) [2024-04-29 22:21:19.857] [INFO] Received POST request to /v1/embeddings with body: { "model": "nomic-ai/nomic-embed-text-v1.5-GGUF/nomic-embed-text-v1.5.Q8_0.gguf", "input": "e" } [2024-04-29 22:21:19.865] [INFO] Returning embeddings (not shown in logs) [2024-04-29 22:21:19.866] [INFO] Received POST request to /v1/embeddings with body: { "model": "nomic-ai/nomic-embed-text-v1.5-GGUF/nomic-embed-text-v1.5.Q8_0.gguf", "input": "f" } [2024-04-29 22:21:19.874] [INFO] Returning embeddings (not shown in logs) [2024-04-29 22:21:19.876] [INFO] Received POST request to /v1/embeddings with body: { "model": "nomic-ai/nomic-embed-text-v1.5-GGUF/nomic-embed-text-v1.5.Q8_0.gguf", "input": "i" } [2024-04-29 22:21:19.884] [INFO] Returning embeddings (not shown in logs) [2024-04-29 22:21:19.886] [INFO] Received POST request to /v1/embeddings with body: { "model": "nomic-ai/nomic-embed-text-v1.5-GGUF/nomic-embed-text-v1.5.Q8_0.gguf", "input": "n" } [2024-04-29 22:21:19.893] [INFO] Returning embeddings (not shown in logs) [2024-04-29 22:21:19.895] [INFO] Received POST request to /v1/embeddings with body: { "model": "nomic-ai/nomic-embed-text-v1.5-GGUF/nomic-embed-text-v1.5.Q8_0.gguf", "input": "e" } [2024-04-29 22:21:19.903] [INFO] Returning embeddings (not shown in logs) [2024-04-29 22:21:19.904] [INFO] Received POST request to /v1/embeddings with body: { "model": "nomic-ai/nomic-embed-text-v1.5-GGUF/nomic-embed-text-v1.5.Q8_0.gguf", "input": " " } [2024-04-29 22:21:19.912] [INFO] Returning embeddings (not shown in logs) [2024-04-29 22:21:19.913] [INFO] Received POST request to /v1/embeddings with body: { "model": "nomic-ai/nomic-embed-text-v1.5-GGUF/nomic-embed-text-v1.5.Q8_0.gguf", "input": "X" } [2024-04-29 22:21:19.921] [INFO] Returning embeddings (not shown in logs) [2024-04-29 22:21:19.923] [INFO] Received POST request to /v1/embeddings with body: { "model": "nomic-ai/nomic-embed-text-v1.5-GGUF/nomic-embed-text-v1.5.Q8_0.gguf", "input": "S" } [2024-04-29 22:21:19.931] [INFO] Returning embeddings (not shown in logs) [2024-04-29 22:21:19.932] [INFO] Received POST request to /v1/embeddings with body: { "model": "nomic-ai/nomic-embed-text-v1.5-GGUF/nomic-embed-text-v1.5.Q8_0.gguf", "input": "S" } [2024-04-29 22:21:19.940] [INFO] Returning embeddings (not shown in logs) ```

Are there known steps to reproduce?

No response

timothycarambat commented 3 months ago

What did you set as Max embedding chunk length when configuring the connection?

frost19k commented 3 months ago

1000

I don't understand, documents are chunked appropriately. This happens only with user query.

Docs chunking logs

```bash [2024-04-30 01:39:11.237] [INFO] Received POST request to /v1/embeddings with body: { "model": "nomic-ai/nomic-embed-text-v1.5-GGUF/nomic-embed-text-v1.5.Q8_0.gguf", "input": "$ wfuzz -w wordlist.txt -v –-follow http://example.com?redirect=FUZZ \nFinally, test for vulnerabilities such as XSS and SQL injection by fuzzing \nURL parameters, POST parameters, or other user input locations with com - \nmon payload lists. \nWhen testing for XSS by using Wfuzz, try creating a list of scripts that \nredirect the user to your page, and then turn on the verbose option to \nmonitor for any redirects. Alternatively, you can use Wfuzz content filters to \ncheck for XSS payloads reflected. The --filter flag lets you set a result filter. \nAn especially useful filter is content~ STRING , which returns responses that \ncontain whatever STRING is: \n$ wfuzz -w xss.txt --filter \"content~FUZZ\" http://example.com/get_user?user_id=FUZZ378 Chapter 25 \nFor SQL injection vulnerabilities, try using a premade SQL injection \nwordlist and monitor for anomalies in the response time, response code, \nor response length of each payload. If you use SQL injection payloads that" } [2024-04-30 01:39:11.264] [INFO] Returning embeddings (not shown in logs) [2024-04-30 01:39:11.266] [INFO] Received POST request to /v1/embeddings with body: { "model": "nomic-ai/nomic-embed-text-v1.5-GGUF/nomic-embed-text-v1.5.Q8_0.gguf", "input": "include time delays, look for long response times. If most payloads return a \ncertain response code but one does not, investigate that response further to \nsee if there’s a SQL injection there. A longer response length might also be \nan indication that you were able to extract data from the database. \nThe following command tests for SQL injection using the wordlist sqli.txt . \nYou can specify POST body data with the -d flag: \n$ wfuzz -w sqli.txt -d \"user_id=FUZZ\" http://example.com/get_user \nMore About Wfuzz \nWfuzz has many more advanced options, filters, and customizations that you \ncan take advantage of. Used to its full potential, Wfuzz can automate the \nmost tedious parts of your workflow and help you find more bugs. For more \ncool Wfuzz tricks, read its documentation at https://wfuzz.readthedocs.io/ . \nFuzzing vs. Static Analysis \nIn Chapter 22 , I discussed the effectiveness of source code review for dis -" } [2024-04-30 01:39:11.293] [INFO] Returning embeddings (not shown in logs) [2024-04-30 01:39:11.295] [INFO] Received POST request to /v1/embeddings with body: { "model": "nomic-ai/nomic-embed-text-v1.5-GGUF/nomic-embed-text-v1.5.Q8_0.gguf", "input": "covering web vulnerabilities. You might now be wondering: why not just \nperform a static analysis of the code? Why conduct fuzz testing at all? \nStatic code analysis is an invaluable tool for identifying bugs and improper \nprogramming practices that attackers can exploit. However, static analysis has \nits limitations. \nFirst, it evaluates an application in a non-live state. Performing code \nreview on an application won’t let you simulate how the application will \nreact when it’s running live and clients are interacting with it, and it’s very \ndifficult to predict all the possible malicious inputs an attacker can provide. \nStatic code analysis also requires access to the application’s source code. \nWhen you’re doing a black-box test, as in a bug bounty scenario, you probably \nwon’t be able to obtain the source code unless you can leak the application’s \nsource code or identify the open source components the application is using." } [2024-04-30 01:39:11.319] [INFO] Returning embeddings (not shown in logs) [2024-04-30 01:39:11.321] [INFO] Received POST request to /v1/embeddings with body: { "model": "nomic-ai/nomic-embed-text-v1.5-GGUF/nomic-embed-text-v1.5.Q8_0.gguf", "input": "This makes fuzzing a great way of adding to your testing methodology, since \nyou won’t need the source code to fuzz an application. \nPitfalls of Fuzzing \nOf course, fuzzing isn’t a magic cure-all solution for all bug detection. This \ntechnique has certain limitations, one of which is rate-limiting by the server. \nDuring a remote, black-box engagement, you might not be able to send in \nlarge numbers of payloads to the application without the server detecting \nyour activity, or you hitting some kind of rate limit. This can cause your test - \ning to slow down or the server might ban you from the service.Automatic Vulnerability Discovery Using Fuzzers 379 \nIn a black-box test, it can also be difficult to accurately evaluate the \nimpact of the bug found through fuzzing, since you don’t have access to the \ncode and so are getting a limited sample of the application’s behavior. You’ll \noften need to conduct further manual testing to classify the bug’s validity" } [2024-04-30 01:39:11.347] [INFO] Returning embeddings (not shown in logs) [2024-04-30 01:39:11.349] [INFO] Received POST request to /v1/embeddings with body: { "model": "nomic-ai/nomic-embed-text-v1.5-GGUF/nomic-embed-text-v1.5.Q8_0.gguf", "input": "and significance. Think of fuzzing as a metal detector: it merely points you \nto the suspicious spots. In the end, you need to inspect more closely to see if \nyou have found something of value. \nAnother limitation involves the classes of bugs that fuzzing can find. \nAlthough fuzzing is good at finding certain basic vulnerabilities like XSS \nand SQL injection, and can sometimes aid in the discovery of new bug \ntypes, it isn’t much help in detecting business logic errors, or bugs that \nrequire multiple steps to exploit. These complex bugs are a big source of \npotential attacks and still need to be teased out manually. While fuzzing \nshould be an essential part of your testing process, it should by no means be \nthe only part of it. \nAdding to Your Automated Testing Toolkit \nAutomated testing tools like fuzzers or scanners can help you discover \nsome bugs, but they often hinder your learning progress if you don’t take" } [2024-04-30 01:39:11.376] [INFO] Returning embeddings (not shown in logs) [2024-04-30 01:39:11.378] [INFO] Received POST request to /v1/embeddings with body: { "model": "nomic-ai/nomic-embed-text-v1.5-GGUF/nomic-embed-text-v1.5.Q8_0.gguf", "input": "the time to understand how each tool in your testing toolkit works. Thus, \nbefore adding a tool to your workflow, be sure to take time to read the \ntool’s documentation and understand how it works. You should do this for \nall the recon and testing tools you use. \nBesides reading the tool’s documentation, I also recommend reading \nits source code if it’s open source. This can teach you about the methodolo - \ngies of other hackers and provide insight into how the best hackers in the \nfield approach their testing. Finally, by learning how others automate hack - \ning, you’ll begin learning how to write your own tools as well. \nHere’s a challenge for you: read the source code of the tools Sublist3r \n( https://github.com/aboul3la/Sublist3r/ ) and Wfuzz ( https://github.com/xmendez/ \nwfuzz/ ). These are both easy-to-understand tools written in Python. Sublist3r \nis a subdomain enumeration tool, while Wfuzz is a web application fuzzer." } [2024-04-30 01:39:11.404] [INFO] Returning embeddings (not shown in logs) [2024-04-30 01:39:11.405] [INFO] Received POST request to /v1/embeddings with body: { "model": "nomic-ai/nomic-embed-text-v1.5-GGUF/nomic-embed-text-v1.5.Q8_0.gguf", "input": "How does Sublist3r approach subdomain enumeration? How does Wfuzz \nfuzz web applications? Can you write down their application logic, starting \nfrom the point at which they receive an input target and ending when they \noutput their results? Can you rewrite the functionalities they implement \nusing a different approach? \nOnce you’ve gained a solid understanding of how your tools work, try to \nmodify them to add new features! If you think others would find your feature \nuseful, you could contribute to the open source project: propose that your \nfeature be added to the official version of the tool. \nUnderstanding how your tools and exploits work is the key to becoming \na master hacker. Good luck and happy hacking!" } [2024-04-30 01:39:11.431] [INFO] Returning embeddings (not shown in logs) ```

Screenshot from 2024-04-30 01-30-18

Screenshot from 2024-04-30 01-35-51

Screenshot from 2024-04-30 01-35-23

Suedzucka commented 3 months ago

Same issue here. The problem is independent of the chosen embeddings model in lm studio (tested with the lm studio standard embedder nomic-embed-text-v1.5-GGUF and with the all-MiniLM-L6-v2). The user prompt is sent to the lm studio embedder in individual letters via POST each by each and not as a whole expression.

frost19k commented 3 months ago

Thank you so much for fixing this so quickly. I've tried out the docker image and it works great. Although, I was wondering how long before this change is reflected in the Appimage?

timothycarambat commented 3 months ago

@frost19k Thank you for pointing out this error!

As for desktop, we typically try to push at least a patch version per-week. Usually Wed/Thurs, any updates from master since the preceding patch will appear in the next version.

Propheticus commented 3 months ago

Confirmed working on v1.5.2 ! Query is not sent to embedding endpoint as a single sequence.

Mintplex-Labs / anything-llm