Open AN-BC opened 5 months ago
Vectorization should probably be handled in an async process downstream from the actual upload.
sequenceDiagram
User->>+API: Create File
API->>User: Pre-auth upload URL
User->>Cloud: Upload file
User->>API: Confirm upload
API-->>Queue: Vectorization request
API->>-User: Updated File record
Note left of API: End of request<br />lifcycle
Queue-->>API: Dequeue vectorization request
Note right of API: Vectorization<br/>goes here
We could add a flag to the File entity to indicate whether the vector is available (maybe even a more detailed status to indicate different new/refreshing/ready states).
Agreed, so long as we have some reasonable sense of this being robust it is fine, it isn’t needed to be ATOMic for sure, but pretty reliable
From: Craig Adam @.> Sent: Tuesday, May 28, 2024 10:40 AM To: MemberJunction/MJ @.> Cc: Amith Nagarajan @.>; Author @.> Subject: Re: [MemberJunction/MJ] Auto-vectorization of Files (Issue #237)
Vectorization should probably be handled in an async process downstream from the actual upload.
sequenceDiagram
User->>+API: Create File
API->>User: Pre-auth upload URL
User->>Cloud: Upload file
User->>API: Confirm upload
API-->>Queue: Vectorization request
API->>-User: Updated File record
Note left of API: End of request<br />lifcycle
Queue-->>API: Dequeue vectorization request
Note right of API: Vectorization<br/>goes here
We could add a flag to the File entity to indicate whether the vector is available (maybe even a more detailed status to indicate different new/refreshing/ready states).
— Reply to this email directly, view it on GitHubhttps://github.com/MemberJunction/MJ/issues/237#issuecomment-2135565494, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AXGYIQMDKTMLUQXSDKQ67RLZESQM5AVCNFSM6AAAAABINDBI7WVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMZVGU3DKNBZGQ. You are receiving this because you authored the thread.Message ID: @.***>
@AN-BC Once we figure out the chunking strategy yes this should be easy to implement
@JS-BC y @cadam11 is this ready to be closed ?
@AN-BC The infrastructure is in place for this, but I have not done any work in vectorizing files. I know @nico-oz has been working on vectorizing files, so maybe he can chime in here.
@cadam11 I think we should use the vectorization functionality now in the core MJ framework to offer people the ability to auto-vectorize a file or all files in a category. My thought is that:
Probably will need a chunking strategy to make this work and @dray-mc could probably provide some ideas on this.
The biz benefit is that if we auto-vectorize a given file category a user could have a category for "articles" or whatever and whenever they upload files there programmatically or via the MJ Explorer UI, the vector db would be maintained in parallel automatically.
This would then allow, if the embeddings were stored in the same vector index as entity record vectors, comparison of unstructured/semi-structured data in files against structured data in the relational DB which could be very powerful.
Given the infra we already have for vectorization I think this might be pretty easy to do, @JS-BC ?