Elliott-Chong / chatpdf-yt

https://chatpdf-elliott.vercel.app
694 stars 284 forks source link

Error inserting vectors into pinecone #32

Open joepds opened 10 months ago

joepds commented 10 months ago

Hi, i got some error when inserting vector to pinecone. This is the error that i got, if someone have the same problem and fix it can you help me to give some solution. Thank you

inserting vectors into pinecone PineconeBadRequestError: The requested feature 'Namespaces' is not supported by the current index type 'Starter'. at mapHttpStatusError (webpack-internal:///(rsc)/./node_modules/@pinecone-database/pinecone/dist/errors/http.js:179:20) at eval (webpack-internal:///(rsc)/./node_modules/@pinecone-database/pinecone/dist/errors/handling.js:170:55) at step (webpack-internal:///(rsc)/./node_modules/@pinecone-database/pinecone/dist/errors/handling.js:107:23) at Object.eval [as next] (webpack-internal:///(rsc)/./node_modules/@pinecone-database/pinecone/dist/errors/handling.js:48:20) at fulfilled (webpack-internal:///(rsc)/./node_modules/@pinecone-database/pinecone/dist/errors/handling.js:11:32) at process.processTicksAndRejections (node:internal/process/task_queues:95:5) { cause: undefined } ⨯ node_modules\@pinecone-database\pinecone\dist\errors\http.js (179:19) @ mapHttpStatusError ⨯ unhandledRejection: PineconeBadRequestError: The requested feature 'Namespaces' is not supported by the current index type 'Starter'. at mapHttpStatusError (webpack-internal:///(rsc)/./node_modules/@pinecone-database/pinecone/dist/errors/http.js:179:20) at eval (webpack-internal:///(rsc)/./node_modules/@pinecone-database/pinecone/dist/errors/handling.js:170:55) at step (webpack-internal:///(rsc)/./node_modules/@pinecone-database/pinecone/dist/errors/handling.js:107:23) at Object.eval [as next] (webpack-internal:///(rsc)/./node_modules/@pinecone-database/pinecone/dist/errors/handling.js:48:20) at fulfilled (webpack-internal:///(rsc)/./node_modules/@pinecone-database/pinecone/dist/errors/handling.js:11:32) at process.processTicksAndRejections (node:internal/process/task_queues:95:5) { cause: undefined } null ⨯ node_modules\@pinecone-database\pinecone\dist\errors\http.js (179:19) @ mapHttpStatusError ⨯ unhandledRejection: PineconeBadRequestError: The requested feature 'Namespaces' is not supported by the current index type 'Starter'. at mapHttpStatusError (webpack-internal:///(rsc)/./node_modules/@pinecone-database/pinecone/dist/errors/http.js:179:20) at eval (webpack-internal:///(rsc)/./node_modules/@pinecone-database/pinecone/dist/errors/handling.js:170:55) at step (webpack-internal:///(rsc)/./node_modules/@pinecone-database/pinecone/dist/errors/handling.js:107:23) at Object.eval [as next] (webpack-internal:///(rsc)/./node_modules/@pinecone-database/pinecone/dist/errors/handling.js:48:20) at fulfilled (webpack-internal:///(rsc)/./node_modules/@pinecone-database/pinecone/dist/errors/handling.js:11:32) at process.processTicksAndRejections (node:internal/process/task_queues:95:5) { cause: undefined } null

joepds commented 10 months ago

is it because the environment is starter but how do change it like elliot that have the ap-southeast-asia-1?

image

maximBurinskyi commented 10 months ago

Probably you should pay $70 and I have the same problem

joepds commented 10 months ago

probably, i read pinecone documentation that we can use metada filtering but idk how to implement it to the code

CodeOfMugiwara commented 10 months ago

well I'm too facing the issue. and I've found 2 ways to clear this problem....

  1. upgrading the pinecone current plan to standard plan and changing the namespaces from there. ( This is costly )
  2. using other Database similar to pinecone and in this case chroma Database is the best ne and also it's free and opensource too. I suggest to use chromaDB. but the problem here is how to use the chroma DB even I don;t know the exact process of using it and how to write the code for it. But I'm trying to figure it out of using chromaDB.

    And I'll definitely provide you the code and the process once I found the exact solution.

tzechong94 commented 10 months ago

Hi all, i managed to solve this with 'filtering with metadata' as proposed by pinecone's documentation. What i did:

  1. In loadS3IntoPinecone function of the pinecone.ts file, add fileKey as one of the metadata fields.
import { Pinecone, PineconeRecord } from "@pinecone-database/pinecone";
import { downloadFromS3 } from "./s3-server";
import { PDFLoader } from "langchain/document_loaders/fs/pdf";
import md5 from "md5";
import {
  Document,
  RecursiveCharacterTextSplitter,
} from "@pinecone-database/doc-splitter";
import { getEmbeddings } from "./embeddings";
import { convertToAscii } from "./utils";

export const getPineconeClient = () => {
  return new Pinecone({
    environment: process.env.PINECONE_ENVIRONMENT!,
    apiKey: process.env.PINECONE_API_KEY!,
  });
};

type PDFPage = {
  pageContent: string;
  metadata: {
    loc: { pageNumber: number; fileKey: string };
  };
};

export async function loadS3IntoPinecone(fileKey: string) {
  console.log("downloading s3 into file system");
  const file_name = await downloadFromS3(fileKey);
  if (!file_name) throw new Error("could not download file from s3");
  const loader = new PDFLoader(file_name);
  console.log(loader, "loader");
  const pages = (await loader.load()) as PDFPage[];
  console.log(pages, "pages");
  const documents = await Promise.all(
    pages.map((page) => prepareDocument(page, fileKey))
  );

  const vectors = await Promise.all(
    documents.flat().map((doc) => embedDocument(doc, fileKey))
  );

  const client = await getPineconeClient();
  const pineconeIndex = await client.index("chatpdf");

  console.log("Inserting vectors into pinecone");
  const request = vectors;
  await pineconeIndex.upsert(request);
  console.log("Inserted vectors into pinecone");

  return documents[0];
}

async function embedDocument(doc: Document, fileKey: string) {
  try {
    const embeddings = await getEmbeddings(doc.pageContent);
    const hash = md5(doc.pageContent);

    return {
      id: hash,
      values: embeddings,
      metadata: {
        text: doc.metadata.text,
        pageNumber: doc.metadata.pageNumber,
        fileKey,
      },
    } as PineconeRecord;
  } catch (error) {
    console.log("error embedding document", error);
    throw error;
  }
}

export const truncateStringByBytes = (str: string, bytes: number) => {
  const enc = new TextEncoder();
  return new TextDecoder("utf-8").decode(enc.encode(str).slice(0, bytes));
};

async function prepareDocument(page: PDFPage, fileKey: string) {
  console.log(page, "page in preparedoc");
  let { pageContent, metadata } = page;
  pageContent = pageContent.replace(/\n/g, "");
  // split the docs
  const splitter = new RecursiveCharacterTextSplitter();
  const docs = await splitter.splitDocuments([
    new Document({
      pageContent,
      metadata: {
        pageNumber: metadata.loc.pageNumber,
        text: truncateStringByBytes(pageContent, 36000),
        fileKey,
      },
    }),
  ]);
  return docs;
}
  1. In context.js, use query() instead of namespaces.
export async function getMatchesFromEmbeddings(
  embeddings: number[],
  fileKey: string
) {
  try {
    const client = new Pinecone({
      environment: process.env.PINECONE_ENVIRONMENT!,
      apiKey: process.env.PINECONE_API_KEY!,
    });
    const pineconeIndex = await client.index("chatpdf");
    const queryResponse = await pineconeIndex.query({
      vector: embeddings,
      filter: { fileKey: { $eq: fileKey } },
      topK: 5,
      includeMetadata: true,
    });

    return queryResponse.matches || [];
  } catch (error) {
    console.log("error querying embeddings", error);
    throw error;
  }
}
CodeOfMugiwara commented 10 months ago

@tzechong94 It's really working man. I literally spent so much time to figure it out thank you very much 😁😁

joepds commented 10 months ago

@tzechong94 Thank you for the code! It works perfectly.

CodeOfMugiwara commented 10 months ago

After the vectors inserted into pinecone the page is being redirected but i can't understand that the chat is pushed into database or not. The URL is like http://localhost:3000/chat/[object%20Object] instead of http://localhost:3000/chat/1 like in video tutorial of Elliott-Chong.

I want to clear my self that the chat is being pushed into database or not and why the URL is different for me ?

joepds commented 10 months ago

@CodeOfMugiwara have you check to drizzle? try access drizlle use this 127.0.0.1:4983 if the link from drizzle studio cant be access

CodeOfMugiwara commented 10 months ago

@joepds Thank you for the url you've provided the schema is creating succesfully and chat id's are also generated perfectly but I am still facing the url issue. it's being redirected as http://localhost:3000/chat/[object%20Object] instead of 'http://localhost:3000/chat/1' How do I solve this issue

adityatejas562 commented 2 months ago

@tzechong94 what should be used instead of pdf loader as it is deprecated now