freedmand / semantra

Multi-tool for semantic search
MIT License
2.49k stars 139 forks source link

Fantastic. Enhancement ideas #24

Open jasmeetchadha opened 1 year ago

jasmeetchadha commented 1 year ago

I like the idea of semantic search. To expand on this (and I am sure it has been said before). It would be good to add:

UI / Input 1) Options to interact from WebUI iso command line. Adding files etc 2) Increased filetype coverage - include office, markdown, images (?) 3) Input a folder - with all files covered

Processing 4) Specify where embeddings are stored from UI (or make it clear in readme) 5) Save embeddings to avoid repeat processing 5) Auto-update embedding at startup for any new files that may have been added to folder 6) Enable GPU mode

Summarization (additional) 7) Perhaps use Vicuna (quite good already) or another local LLM to process the results from the search query into a coherent reply (with sources, links to source document). 8) This reply can be adjacent to a "source window" (in case of multiple documents), keeping full visibility into source text for the user.

freedmand commented 1 year ago

Thanks for this write-up. Pretty spot-on in terms of things I'd like to tackle soon. Will address some thoughts for each:

  1. would like the WebUI to be able to do everything too. This will take a bit of rearchitecting the system per #13 to support a separate indexing process
  2. Agreed re: file types. Thinking how to do this well and have a modular system for file loaders. You don't want to have to reprocess things you don't have to. Looking into Apache Tika for this with Python bindings here, but it does require Java. There's some other solutions but the dependencies seem a bit worse. Open to ideas
  3. Yep, covered by #6
  4. This is possible already but I can make the README more clear. Run semantra --show-semantra-dir to see where they are all stored, or semantra --semantra-dir [folder] to change it.
  5. This is how Semantra works. Nothing will be re-embedded if it doesn't have to be. Again, I could probably make this more clear in the README since it's part of what makes Semantra different from some other tools
  6. This is automatically leveraged if possible. Semantra will utilize the GPU via Cuda if it can. I should probably document this and add command-line options to control it. 7 + 8. Summarization is currently not on my roadmap since it's not clear how to do it in a non-problematic way. For instance, how do you force summarization to show sources in a robust way? How do you ensure it doesn't utilize outside knowledge? Avoiding summarization has been a guiding force to design Semantra to be more useful with simpler, faster tools. See #17 and the README for more info. (I am open to revisiting this in the future if some concerns can be addressed here)
jasmeetchadha commented 1 year ago

I get your hesitance with summarisation. I think highlighting source information alongside a well written summary is pretty cool. My point is more to your point of keeping things local. And to get that well written "completion" using a local copy of Vicuna iso OpenAI

Funny enough I found a repo that does it and demoed here: https://vault.pash.city/ https://github.com/pashpashpash/vault-ai

freedmand commented 1 year ago

The local LLMs idea is definitely intriguing. An additional layer you could potentially utilize with local models: visually showing the model's attention over parts of the summary. You could potentially click on parts of the summary and see which search result context is most utilized to inform each part of the summary. I think I'd only be interested in summarization if we could take it a step further like this with more explainability built-in 😃

endolith commented 1 year ago

@jasmeetchadha You might be interested in LocalGPT, which also feeds embeddings results to a local LLM. It requires more resources than my local machines can handle, though, and I don't trust the output of LLMs. Semantra looks more promising for the kinds of things I want to do, finding verbatim citations by concept, etc.

https://github.com/PromtEngineer/localGPT/discussions/186

I would sometimes want to feed the output to an LLM, though, to find an answer to a question that requires combining multiple result chunks of technical text, for instance.

I also plan to use this to search documents in other languages or multiple languages in one file, so it would be good if I could see a translation of the citation into English alongside the quote instead of having to copy and paste into a translator. So some general thing that makes it easier to pipe the results into other tools would be nice. (But I haven't even tried Semantra, yet.)