Muennighoff / sgpt

SGPT: GPT Sentence Embeddings for Semantic Search
https://arxiv.org/abs/2202.08904
MIT License
823 stars 51 forks source link

Usage for semantic search #47

Open rut00 opened 4 months ago

rut00 commented 4 months ago

Hello, I want to create a semantic search functionality. The model is accurate in calculating the similarity between the word synonyms. Here are a few excerpts of the demo: Supposedly, my dataset has the following lines:

User: Who is the author of "Romeo and Juliet"?
Model: "Romeo and Juliet" was written by William Shakespeare.

User: Describe the water cycle.
Model: The water cycle is the continuous movement of water on, above, and below the surface of the Earth, involving processes like evaporation, condensation, precipitation, and runoff.

User: Describe the process of DNA replication.
Model: DNA replication is the process of copying DNA to ensure genetic continuity during cell division.

And my prompts:

Prompt 1: rameo and juliet
Output: "Romeo and Juliet" was written by William Shakespeare.

Prompt 2: Tell me about water cycle
Output: The water cycle is the continuous movement of water on, above, and below the surface of the Earth, involving processes like evaporation, condensation, precipitation, and runoff.

Prompt 3: Deoxyribonucleic acid
Output: DNA replication is the process of copying DNA to ensure genetic continuity during cell division.

Prompt 4: what is python language
Output: Photosynthesis is the process by which green plants and some other organisms use sunlight to synthesize foods with the help of chlorophyll pigments.

The confidence value for each of the prompts ranges between 0.25 and 0.4. The issue I am facing is, that the model gives the same confidence value for wrong and right prompt outputs as seen in prompt 4 output. I want to show "No results found" if the given word is not in the dataset.

How do I solve this issue and make it more efficient? Thank you in advance.

Muennighoff commented 4 months ago

You're using the Cross-Encoder, correct?

rut00 commented 4 months ago

No, I am using Asymmetric Semantic Search Bi-encoder.

Muennighoff commented 4 months ago

I see, so you're saying that the cosine similarity for what is python language and Photosynthesis is the process by which green plants and s... is as high as the other ones?

rut00 commented 4 months ago

Yes. The confidence levels are so similar that I cannot put a threshold level for differentiating them.

Muennighoff commented 4 months ago

Hm what model are you using? I'd recommend switching to a bigger / better one, specifically I'd recommend this one: https://huggingface.co/GritLM/GritLM-7B

rut00 commented 4 months ago

I am using this model: SGPT-125M-weightedmean-msmarco-specb-bitfit and I will try the recommended model.