Stuck on reading large datasets

dothe-Best commented 6 days ago

When I use GPT to read a large local data set of several hundred MB in hybrid mode, GPT will start reading. When I check the Activity Monitor on my Mac, I can see that the memory usage of python3.13 exceeds 60GB. This indicates that GPT is indeed reading the data set. However, after about twenty minutes, this memory will be released. In my understanding, GPT has completed reading the data set, but it is always stuck on the interface of reading the data set and does not move. No matter how many hours I wait, there will be no further progress. What on earth is the reason for this?

danieldekay commented 6 days ago

Can you share a bit more of what you are doing and what you want to achieve?

dothe-Best commented 6 days ago

I want to build a network for network analysis with the dataset, I want gptr to read the dataset correctly and give me some ideas in conjunction with the dataset.

Can you share a bit more of what you are doing and what you want to achieve?

danieldekay commented 6 days ago

@assafelovic - I was out for a while. Do we expect GPTR to deal with numercial data sets like this?

ElishaKay commented 2 days ago

Welcome @dothe-Best

It sounds like you'll want to set up a separate process for data ingestion.

GPTR is using Langchain Documents and Langchain VectorStores under the hood.

The flow would be:

Step 1: transform your content into Langchain Documents

Step 2: Insert your Langchain Documents into your Langchain VectorStore

Step 3: Pass your Langchain Vectorstore into your GPTR report run (more examples here and below)

Note: if your embedding model is having trouble with api limits or the DB you're using under the hood for your Langchain VectorStore needs to pace itself, you can handle that within your python.

In the example below, we're splitting the documents list into chunks of 100 & then inserting 1 chunk at a time into the vector store.

Code samples below:

Assuming your .env variables are like so:

OPENAI_API_KEY={Your OpenAI API Key here}
TAVILY_API_KEY={Your Tavily API Key here}

PGVECTOR_CONNECTION_STRING=postgresql://username:password...

Step 1:

from langchain_core.documents import Document
from langchain.text_splitter import RecursiveCharacterTextSplitter

async def transform_to_langchain_docs(self, directory_structure):
    documents = []
    splitter = RecursiveCharacterTextSplitter(chunk_size=200, chunk_overlap=30)
    run_timestamp = datetime.utcnow().strftime('%Y%m%d%H%M%S')

    for file_name in directory_structure:
        if not file_name.endswith('/'):
            try:
                content = self.repo.get_contents(file_name, ref=self.branch_name)
                try:
                    decoded_content = base64.b64decode(content.content).decode()
                except Exception as e:
                    print(f"Error decoding content: {e}")
                    print("the problematic file_name is", file_name)
                    continue
                print("file_name", file_name)
                print("content", decoded_content)

                # Split each document into smaller chunks
                chunks = splitter.split_text(decoded_content)

                # Extract metadata for each chunk
                for index, chunk in enumerate(chunks):
                    metadata = {
                        "id": f"{run_timestamp}_{uuid4()}",  # Generate a unique UUID for each document
                        "source": file_name,
                        "title": file_name,
                        "extension": os.path.splitext(file_name)[1],
                        "file_path": file_name
                    }
                    document = Document(
                        page_content=chunk,
                        metadata=metadata
                    )
                    documents.append(document)

            except Exception as e:
                print(f"Error saving to vector store: {e}")
                return None

    await save_to_vector_store(documents)

Step 2:

from langchain_postgres import PGVector
from langchain_postgres.vectorstores import PGVector
from sqlalchemy.ext.asyncio import create_async_engine

from langchain_community.embeddings import OpenAIEmbeddings

async def save_to_vector_store(self, documents):
    # The documents are already Document objects, so we don't need to convert them
    embeddings = OpenAIEmbeddings()
    # self.vector_store = FAISS.from_documents(documents, embeddings)
    pgvector_connection_string = os.environ["PGVECTOR_CONNECTION_STRING"]

    collection_name = "my_docs"

    vector_store = PGVector(
        embeddings=embeddings,
        collection_name=collection_name,
        connection=pgvector_connection_string,
        use_jsonb=True
    )

    # for faiss
    # self.vector_store = vector_store.add_documents(documents, ids=[doc.metadata["id"] for doc in documents])

    # Split the documents list into chunks of 100
    for i in range(0, len(documents), 100):
        chunk = documents[i:i+100]
        # Insert the chunk into the vector store
        vector_store.add_documents(chunk, ids=[doc.metadata["id"] for doc in chunk])

Step 3:

async_connection_string = pgvector_connection_string.replace("postgresql://", "postgresql+psycopg://")

# Initialize the async engine with the psycopg3 driver
async_engine = create_async_engine(
    async_connection_string,
    echo=True
)

async_vector_store = PGVector(
    embeddings=embeddings,
    collection_name=collection_name,
    connection=async_engine,
    use_jsonb=True
)

researcher = GPTResearcher(
    query=query,
    report_type="research_report",
    report_source="langchain_vectorstore",
    vector_store=async_vector_store,
)
await researcher.conduct_research()
report = await researcher.write_report()

assafelovic / gpt-researcher

Stuck on reading large datasets #987