FlowiseAI / Flowise

Drag & drop UI to build your customized LLM flow
https://flowiseai.com
Apache License 2.0
31.84k stars 16.6k forks source link

"exceeds maximum batch" size when storing vector db #1994

Closed TivonJJ closed 2 months ago

TivonJJ commented 8 months ago

Describe the bug I'm using Cheerio Web Scraper to fetch a website's content, and then I'm getting an error when storing it into a vector library. Got error "exceeds maximum batch" from chroma db.

I checked the information and found that the vector libraries all have limitations, Like chroma,pinecone... I tried to reduce the number of chunks and increase the size of every chunks, but I got a new error: "413 Request Entity Too Large" from vector db.

The I checked the code in packages/components/nodes/vectorstores/Chroma/Chroma.ts.

 const flattenDocs = docs && docs.length ? flatten(docs) : []
            const finalDocs = []
            for (let i = 0; i < flattenDocs.length; i += 1) {
                if (flattenDocs[i] && flattenDocs[i].pageContent) {
                    finalDocs.push(new Document(flattenDocs[i]))
                }
            }

            const obj: {
                collectionName: string
                url?: string
                chromaApiKey?: string
            } = { collectionName }
            if (chromaURL) obj.url = chromaURL
            if (chromaApiKey) obj.chromaApiKey = chromaApiKey

            try {
                await ChromaExtended.fromDocuments(finalDocs, embeddings, obj)
            } catch (e) {
                throw new Error(e)
            }

I found this finalDocs too big. Because a website's knowledge base can be very large,Is it possible to split into multiple chunks for cyclic storage here? Or is there any other way to solve this problem?

Screenshots

The flow

image

Chroma logs

image
HenryHengZJ commented 8 months ago

But the finalDocs are the splitted chunks, it shouldnt be the culprit thats causing 413 Request Entity Too Large

TivonJJ commented 8 months ago

But the finalDocs are the splitted chunks, it shouldnt be the culprit thats causing 413 Request Entity Too Large

But the length of the chunks is still a problem.such as chroma has a maximum length of 166, I printed the length of finalDocs is 419

I tried to change the following code and the problem was solved, but I don't know if there is any problem with this cure.

packages/components/nodes/vectorstores/Chroma/Chroma.ts > upsert


// try {
//     await ChromaExtended.fromDocuments(finalDocs, embeddings, obj)
// } catch (e) {
//     throw new Error(e)
// }

// Modify to the following
            // Split array
            function chunk(array: any[], size: number) {
                const chunkedArray: any[] = []
                for (let i = 0; i < array.length; i += size) {
                    chunkedArray.push(array.slice(i, i + size))
                }
                return chunkedArray
            }

            const chunkedArray = chunk(finalDocs, 100)
            for (const arr of chunkedArray) {
                try {
                    await ChromaExtended.fromDocuments(arr, embeddings, obj)
                } catch (e) {
                    throw new Error(e)
                }
            }

This cyclic storage avoids both excessive length and large every requests(It's because it's been split up into smaller pieces.)