feature/weaviate-text2vec-openai

JedWalton commented 11 months ago

Verify with integration tests that documents are indexed as:

Where the documentId is the uuid in the documents schema.

-- Documents table CREATE TABLE documents ( id UUID DEFAULT uuid_generate_v4() PRIMARY KEY, user_id VARCHAR(255) NOT NULL REFERENCES users(user_id) ON DELETE CASCADE, -- Changed data type to VARCHAR(255) document_name VARCHAR(255) NOT NULL, content TEXT NOT NULL, created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, UNIQUE(user_id, document_name) -- Composite unique constraint remains unchanged );

func createWeaviateDocumentsClass(client *weaviate.Client) { if client == nil { log.Println("Client is nil in createWeaviateDocumentsClass") return }

// Check if the class already exists
if classExists(client, "Documents") {
    log.Println("Class 'Documents' already exists")
    return
}

classObj := &models.Class{
    Class:       "Documents",
    Description: "A document with associated metadata",
    Vectorizer:  "text2vec-openai",
    Properties: []*models.Property{
        {
            DataType:    []string{"string"},
            Description: "Unique identifier of the document",
            Name:        "documentId",
        },
        {
            DataType:    []string{"string"},
            Description: "User identifier associated with the document",
            Name:        "userId",
        },
        {
            DataType:    []string{"string"},
            Description: "Name of the document",
            Name:        "documentName",
        },
        {
            DataType:    []string{"text"},
            Description: "Content of the document",
            Name:        "content",
        },
        {
            DataType:    []string{"date"},
            Description: "Creation timestamp of the document",
            Name:        "createdAt",
        },
        {
            DataType:    []string{"date"},
            Description: "Update timestamp of the document",
            Name:        "updatedAt",
        },
    },
}

err := client.Schema().ClassCreator().WithClass(classObj).Do(context.Background())
if err != nil {
    panic(err)
}

}

JedWalton commented 11 months ago

Adjust so it returns from data for only the userid

JedWalton commented 11 months ago

jed@debian:~/Git/lucidify/backend/lucidify-api$ gip ? lucidify-api [no test files] ? lucidify-api/api/chat [no test files] ok lucidify-api/api/chat/test (cached) ok lucidify-api/api/clerkapi (cached) ok lucidify-api/api/documentsapi (cached) ok lucidify-api/modules/clerkclient (cached) ? lucidify-api/modules/config [no test files] ? lucidify-api/modules/middleware [no test files] ok lucidify-api/modules/store/postgresqlclient (cached) ok lucidify-api/modules/store/store (cached) ok lucidify-api/modules/store/weaviateclient (cached) ? lucidify-api/server [no test files] jed@debian:~/Git/lucidify/backend/lucidify-api$ gip -count=1 ? lucidify-api [no test files] ? lucidify-api/api/chat [no test files] ok lucidify-api/api/chat/test 0.009s 2023/10/04 03:10:52 Defaulting to port 8080 2023/10/04 03:10:52 Defaulting to port 8080 2023/10/04 03:11:13 Cleaning up test user: user_2WHNGzzpXfrXMiYLhjqH6iqboZh --- FAIL: TestIntegration_clerk_handlers (20.92s) clerk_handlers_integration_test.go:47: User not found after creation: User not found after 10 retries clerk_handlers_integration_test.go:59: User first name and last name not updated in users table: User not updated correctly after 10 retries FAIL FAIL lucidify-api/api/clerkapi 20.923s ok lucidify-api/api/documentsapi 3.111s ok lucidify-api/modules/clerkclient 1.187s ? lucidify-api/modules/config [no test files] ? lucidify-api/modules/middleware [no test files] ok lucidify-api/modules/store/postgresqlclient 0.085s ok lucidify-api/modules/store/store 2.726s ok lucidify-api/modules/store/weaviateclient 1.104s ? lucidify-api/server [no test files] FAIL jed@debian:~/Git/lucidify/backend/lucidify-api$

Previously passed test fails.

I have sent curl requests to this endpoint and it behaves as expected.

Not sure why the webhook event is not being logged.

JedWalton commented 11 months ago

Switched to ngrok, resolved issue.

JedWalton commented 11 months ago

Alright, let's break down how your database is currently being indexed and how the CRUD operations are being handled:

Initialization:

When you create a new Weaviate client using NewWeaviateClient(), you check if the class Documents exists. If it doesn't, you create it using createWeaviateDocumentsClass(client). The Documents class has properties like documentId, userId, documentName, chunk, chunkId, createdAt, and updatedAt. UploadDocument:

You split the content into chunks using the splitContentIntoChunks(content) function. This function makes an API call to split the text into chunks. For each chunk, you create a new document in Weaviate with properties like documentId, userId, documentName, chunk, and chunkId. You're not explicitly setting an ID for each chunk, so Weaviate will auto-generate one for you. GetDocument:

You retrieve a document based on its documentId. You then combine all the chunks associated with that documentId to form the complete document content. UpdateDocument:

You first delete all existing chunks for the document using its documentId. Then, you upload the new content using the UploadDocument function. DeleteDocument:

You delete a document and all its associated chunks using its documentId. SearchDocumentsByText:

You search for documents based on certain concepts and a user ID. The search is limited by the limit parameter. From the code, here are some observations and potential improvements:

Indexing:

Each chunk of a document is indexed as a separate object in Weaviate. This means that if a document has 10 chunks, there will be 10 separate objects in Weaviate, all with the same documentId but different content in the chunk property. This approach allows you to leverage Weaviate's vector search capabilities on each chunk. However, it also means that operations like GetDocument and UpdateDocument need to handle multiple chunks. CRUD Operations:

The documentId is crucial for CRUD operations. It's used to identify a document and all its associated chunks. When you upload a document, you're not setting a unique ID for each chunk. Instead, Weaviate auto-generates one. This is fine, but it means that the documentId property is essential for grouping chunks together. The GetDocument function retrieves all chunks associated with a documentId and then combines them to return the full document content. The UpdateDocument function deletes all chunks associated with a documentId and then re-uploads the new content. Potential Improvements:

Consider adding error handling in the UploadDocument function. If uploading one chunk fails, you might end up with a partially uploaded document. In the GetDocument function, you're assuming that the chunks are returned in the correct order. This might not always be the case. Consider sorting the chunks based on the chunkId before combining them. The UpdateDocument function deletes and then re-uploads the content. This might be inefficient for large documents. Consider updating only the chunks that have changed. Overall, your approach of splitting documents into chunks and indexing each chunk separately in Weaviate allows you to leverage Weaviate's vector search capabilities. However, it also adds complexity to CRUD operations, as each operation needs to handle multiple chunks.

JedWalton commented 11 months ago

Manage the chunking in the documents service. Store the chunks into a postgres table.

something like

-- Chunks table CREATE TABLE document_chunks ( chunk_id UUID DEFAULT uuid_generate_v4() PRIMARY KEY, document_id UUID NOT NULL REFERENCES documents(id) ON DELETE CASCADE, chunk_content TEXT NOT NULL, chunk_index INT NOT NULL, -- To keep track of the order of chunks for a document created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP );

-- Index on document_id for the document_chunks table CREATE INDEX idx_document_chunks_document_id ON document_chunks(document_id);

then we want to store these individual chunks in the vector db.

Update the document service to handle this accordingly...

type DocumentService interface { UploadDocument(userID, name, content string) (postgresqlclient.Document, error) GetDocument(userID, name string) (postgresqlclient.Document, error) GetAllDocuments(userID string) ([]postgresqlclient.Document, error) DeleteDocument(userID, name, documentUUID string) error UpdateDocumentName(documentUUID, name string) error UpdateDocumentContent(documentUUID, content string) error }

the documentservice is where we interface with our database.

postgresql and weaviate are wrappers around specific dbs.

our app uses the store package.

JedWalton commented 11 months ago

Ensure that when user deletes account, all associated files are wiped from system. Documents in postgresql, document_chunks, and weaviate.

JedWalton commented 11 months ago

[removed]

JedWalton commented 11 months ago

Trace through tests to ensure full test cleanup.

JedWalton / lucidify

feature/weaviate-text2vec-openai #14