Open stevenbaert opened 5 months ago
Of course you can do that! For me its just too much unvaluable data in all my emails that contaminates the vector search
An LLM should be able to help you cleanup the data/pre processing. Probably there are already Github project which do this. Also the "noise" is the same for one as for million mails, right?
The only relevant approach for me seems putting all data into the vector db, then optimizing that process and doing delta update. Specifically since with Open Source there is no cost involved in the attempt(s). Querying full mailbox potentially give much more valuable insights.
Just thinking out loud: you could do a for each mail make a json object with type:mail, timestamp, subject, from, to, att, bodytoplaintext etc and feed that into vector db. That way you can ask for recent mails from person x. It will be some experimentation but of so much more value and once finetuned, only delta updates needed and way more interesting interactions. Same approach with calendar items.
Why would you embed every search separately? Doesn't make sense to me. Since you use free embeddings and free Llama 3 you could as well vectorize your full mailbox and run a (daily) scheduled job that embeds the diff?