Auto-vectorization of Files

AN-BC commented 5 months ago

@cadam11 I think we should use the vectorization functionality now in the core MJ framework to offer people the ability to auto-vectorize a file or all files in a category. My thought is that:

File Categories have an optional Vector Index linked to them. Whenever this is the case, any file added/updated/deleted to a given category will have embeddings calculated and created/updated/deleted from the vector store as the file is changed
Files - perhaps we offer the ability to have a Vector Index on a per file basis, or perhaps this is only in the future if needs arise.

Probably will need a chunking strategy to make this work and @dray-mc could probably provide some ideas on this.

The biz benefit is that if we auto-vectorize a given file category a user could have a category for "articles" or whatever and whenever they upload files there programmatically or via the MJ Explorer UI, the vector db would be maintained in parallel automatically.

This would then allow, if the embeddings were stored in the same vector index as entity record vectors, comparison of unstructured/semi-structured data in files against structured data in the relational DB which could be very powerful.

Given the infra we already have for vectorization I think this might be pretty easy to do, @JS-BC ?

cadam11 commented 5 months ago

Vectorization should probably be handled in an async process downstream from the actual upload.

sequenceDiagram
    User->>+API: Create File
    API->>User: Pre-auth upload URL
    User->>Cloud: Upload file
    User->>API: Confirm upload
    API-->>Queue: Vectorization request
    API->>-User: Updated File record
    Note left of API: End of request<br />lifcycle

    Queue-->>API: Dequeue vectorization request
    Note right of API: Vectorization<br/>goes here

We could add a flag to the File entity to indicate whether the vector is available (maybe even a more detailed status to indicate different new/refreshing/ready states).

AN-BC commented 5 months ago

Agreed, so long as we have some reasonable sense of this being robust it is fine, it isn’t needed to be ATOMic for sure, but pretty reliable

From: Craig Adam @.> Sent: Tuesday, May 28, 2024 10:40 AM To: MemberJunction/MJ @.> Cc: Amith Nagarajan @.>; Author @.> Subject: Re: [MemberJunction/MJ] Auto-vectorization of Files (Issue #237)

Vectorization should probably be handled in an async process downstream from the actual upload.

sequenceDiagram

User->>+API: Create File

API->>User: Pre-auth upload URL

User->>Cloud: Upload file

User->>API: Confirm upload

API-->>Queue: Vectorization request

API->>-User: Updated File record

Note left of API: End of request<br />lifcycle

Queue-->>API: Dequeue vectorization request

Note right of API: Vectorization<br/>goes here

We could add a flag to the File entity to indicate whether the vector is available (maybe even a more detailed status to indicate different new/refreshing/ready states).

— Reply to this email directly, view it on GitHubhttps://github.com/MemberJunction/MJ/issues/237#issuecomment-2135565494, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AXGYIQMDKTMLUQXSDKQ67RLZESQM5AVCNFSM6AAAAABINDBI7WVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMZVGU3DKNBZGQ. You are receiving this because you authored the thread.Message ID: @.***>

JS-BC commented 5 months ago

@AN-BC Once we figure out the chunking strategy yes this should be easy to implement

AN-BC commented 4 months ago

@JS-BC y @cadam11 is this ready to be closed ?

JS-BC commented 4 months ago

@AN-BC The infrastructure is in place for this, but I have not done any work in vectorizing files. I know @nico-oz has been working on vectorizing files, so maybe he can chime in here.

MemberJunction / MJ

Auto-vectorization of Files #237