FocusedDiversity / synaptiq-hppo

The truth-aware document manager
GNU General Public License v3.0
0 stars 0 forks source link

Import Zohar's PDF library search work into HiPPO #19

Open erskine opened 1 year ago

erskine commented 1 year ago

As a user, I can ask a natural language question of the chatbot, and it will search a defined knowledge base, retrieving the best match, and summarizing the specific section which triggered the match.

erskine commented 1 year ago

Start w/ Synaptiq employee handbook: https://docs.google.com/document/d/1PAtrP07KJQ0IWqNykedBUOkQ4Tc-uS3lEtRxBryH_IA/edit#heading=h.vos3aeas8reg

erskine commented 1 year ago

Will require a document/query encoder, likely SBERT.

erskine commented 1 year ago

For prior art, check out the Structural project: https://github.com/FocusedDiversity/structural_technologies-decision_support_poc/blob/main/structural/sentence_encoder.py

easel commented 12 months ago

Will require a document/query encoder, likely SBERT.

I think https://www.sbert.net/docs/hugging_face.html is likely the good entry point for SBERT. The rest of the SBERT docs have tons of examples of semantic search, question answering, etc and should provide a good bootstrap into the space for us. These models are also reasonably sized so shouldn't require the same horsepower as the generative ones.

easel commented 12 months ago

For prior art, check out the Structural project: https://github.com/FocusedDiversity/structural_technologies-decision_support_poc/blob/main/structural/sentence_encoder.py

https://tfhub.dev/google/universal-sentence-encoder/4 appears to be the upstream reference for the "Universal Sentence Encoder", there's a link to the paper from 2018 at the bottom of the page so I think it's roughly from the same generation as BERT/SBERT (2018/2019).

I'd say for now we need to get the simplest fastest thing going for the encoding and context selection and spend most of our energy on the chunking and prompting for the generative summary. It would be good to wire up both llama-7b, gpt-3.5 and gpt-4 on the generative side so we can compare their performance for the same prompt. Obviously using gpt-4 to to maximize the quality of our summaries in demos won't hurt anything either.