dssjon / biblos

www.biblos.app
http://www.biblos.app
Other
197 stars 14 forks source link

Symmetric vs. Asymmetric Prompts #13

Closed HanClinto closed 11 months ago

HanClinto commented 11 months ago

I'm still very new to LLMs and Langchain, so apologies for fumbling my way through all of this. If I'd known now what I knew before, I would not have done everything that I did in my earlier PRs. :)

Reading through the INSTRUCTOR documentation, in Table 7 it lists the queries that they used in their datasets. It appears that in tasks where they searched across across documents using natural language (as we are doing), most of the examples use asymmetric queries:

MSMARCO

Query instruction: Represent the [domain] question for retrieving evidence documents: Doc instruction: Represent the domain document for retrieval:

gooaq_pairs

Query instruction: Represent the Google question for retrieving answers: Doc instruction: Represent the Google answer for retrieval:

yahoo_answers_title_answer

Query instruction: Represent the Yahoo question for retrieving answers: Doc instruction: Represent the Yahoo answer for retrieval:

eli5_question_answer

Query instruction: Represent the ELI5 question for retrieving answers: Doc instruction: Represent the ELI5 answer for retrieval:

squad_pairs

Query instruction: Represent the Squad question for retrieving evidence documents: Doc instruction: Represent the Squad document for retrieval:

Natural Question

Query instruction: Represent the Wikipedia question for retrieving supporting documents: Doc instruction: Represent the Wikipedia document for retrieval:

amazon-qa

Query instruction: Represent the Amazon question for retrieving answers: Doc instruction: Represent the Amazon answer for retrieval:

And so-on.

Currently it appears that we're using the same query for both query and answer:

Query: https://github.com/dssjon/biblos/blob/main/app.py#L26 Doc: https://github.com/dssjon/biblos/blob/main/data/create_db.py#L20

So perhaps we should use something more along the lines of:

query_instruction = 'Represent the Religious question for retrieving related passages: ' doc_instruction = 'Represent the Religious passage for retrieval: '

We could try a few different ones -- searching for "related" passages, "supporting" passages, we could call them "verses" instead of "passages", etc.

Any thoughts?

dssjon commented 11 months ago

Definitely, let's try them out and compare results!

dssjon commented 11 months ago

PR: https://github.com/dssjon/biblos/pull/15/files adds tests to compare differing query instructions