Thoughts about the output of the Ersilia Assistant

Hi @DhanshreeA , below are some thoughts about the output of the Ersilia Assistant. I hope they are useful... I will basically put a mock example and then we can think how to get there.

Query

I have a library 1000 of natural product compounds and I want to predict their antimalarial activity. These compounds need to be soluble.

Desired output

Part 1: Query comprehension

Here is what I understood from your query:

You have 1000 natural product compounds.
You want to predict antimalarial activity.
You want to predict aqueous solubility.

Part 2: Up to 10 suggested models, with a short explanation

To predict antimalarial activity, you can use the following models from the Ersilia Model Hub:

eos4zfy: Title

A 200-characters explanation of how the model relates to the query, and why is it important.

A 200-characters explanation of how the model relates to the query, and why is it important. In this case, it is important to emphasize that the models contains not only malaria, but other outputs such as tuberculosis.

To predict aqueous solubility, the following model is available:

eos74bo: Title

A 200-characters explanation of how the model relates to the query, and why is it important.

(The above could be laid out as a table, but I am not 100% convinced).

Part 3: The commands

Make sure your list of 1000 natural product compounds is stored in a CSV file in SMILES format. Save it as a one-column file (my_molecules.csv) with a header.

Make sure Ersilia is installed. Then, you can run the following commands:

# Fetch models locally

ersilia fetch eos4zfy
ersilia fetch ...

# Run predictions for malaria

ersilia serve eos4zfy
ersilia run -i my_molecules.csv -o my_molecules_eos4zfy.csv
ersilia close

ersilia serve ...

# Run predictions for aqueous solubility

ersilia serve ...

As a result of running these commands, you will have 3 files, namely my_molecules_eos4zfy.csv, ... Feel free to merge them to rank or filter your molecules!

Part 4: Final remarks

Here we just provided a suggested list of models. To explore the full list, please visit the Ersilia Model Hub browser. If you did not find what you are looking for, please let us know via GitHub Discussions!

How to achieve this outcome?

The above is just a suggestion to get started, and we can definitely improve it or build upon it.

In the backend, I suggest that the following happens:

Query comprehension: We need to engineer a good prompt to translate any given query to a set of bullet points. Importantly, if there are more than one requests inside the query (for example, malaria and solubility), the query comprehension needs to clearly segregate those into two different bullet points.
Embedding-based search based on models metadata. Based on the processed query (point 1), we can do an embedding based search to retrive the most relevant model identifiers (up to 10). Note that, in the provided example, there are two semantically different areas of interest, namely malaria and solubility. We need to check empirically if one or more than one cosine-similarity based searches need to be done. In any case, the result of this is a list of model identifiers.
LLM-based re-ranking with explanations. Ideally, this should take each the model identifiers from above, re-rank them if necessary (and if it works), and provide an explanation for them. This explanation should be directly related to the query, so it cannot be something that is pre-computed. However, for prompt engineering purposes, we could have succinct summaries of the models to aid the re-ranking and the explanation. These TLDRs can be produced offline with GPT-4, not a problem. For example, we could have TLDRs for each model that can be used. Using the whole metadata here, for each of the 10 models or so, would be too much in terms of tokens. Here, of course, it will all depend on latency. We need to evaluate how many queries to the LLM we can afford within reasonable time.
Recipe. To produce the recipe, I think we need a mix between normal string concatenation (e.g. ersilia fetch {0}.format(model_id)) and a bit of an explanation using the LLM. The way I would approach this is as follows. First, we generate the script using string concatenation. Then, we give this script to the LLM and we ask it to provide context to it, that is, add comments (e.g. # run predictions for malaria) and also add some context, for example, a short paragraph in the beginning and a concluding paragraph in the end, perhaps including some concluding remarks such as recommendations to engage with Ersilia or ask for more in GitHub Discussions.

Some final thoughts

I do not know the local LLM well enough, unfortunately. If this was GPT-4o, for example, we could probably achieve all of the above with one single long prompt (no need to break it down into so many small requests). So let's consider if requests need to be done separately or they can be composed in one single prompt, or in a few prompts.
In point 2, I am suggesting to use model metadata only. I am aware publications are available too. Eventually, publication summaries will be captured as part of the LLM-assisted metadata so it is possible that just considering the metadata will be enough, provided this metadata has previously ingested the publication summaries as planned. I am basically talking about this script.

Hi again @Dhanshree. Here is a mini-script I have drafted to make my ideas more clear. Take this as a guidance, it is not supposed to work and it is not supposed to be efficient. I did not test it. I just wrote it without internet while we were experiencing a power cut.

The script assumes that a tldr folder exists containing short summaries of the models.

import json
import os

ROOT = os.path.abspath(os.path.dirname(__file__))
TLDR_DIR = os.path.join(ROOT, "..", "data", "tldr")

SYSTEM_PROMPT_FOR_QUERY_COMPREHENSION = """
You are a biomedical and a drug discovery expert responding to a user query.
Your need to comprehend and interpret user query and divide it into bullet points.
Bullet points should be concise, not ambiguous, and contain only one concept.
Also, bullet points should be logically ordered, if possible.
Importantly, if the query is not related to biomedicine or drug discovery, you should raise a concern and ask the user to please query only biomedical or drug discovery concepts.
""".strip()

SYSTEM_PROMPT_FOR_MODEL_RERANKING_AND_EXPLANATION = """
You are a biomedical and a drug discovery expert.
You will be given a previously ranked list of computational model identifiers from the Ersilia Model Hub, along with a short description.
You will also be provided with a query from the user.
You should do the following:
1. Filter out models if they are not relevant to the user's query.
2. Re-rank models, only if you are very confindent about the re-ranking. Models are already pre-ranked and you should trust this ranking unless you are very confident about modifying it.
3. Group models if they clearly belong to different categories corresponding to distinct bullet points from user's query. Within each group, take ranking into account.
4. Provide a short explanation (maximum 200 characters) for why you think each model is relevant to the user. For this, you can leverage the description of the model. However, you should customize the explanation to respond directly to the user's query.
Structure your answer in Markdown format as follows:
Start with a short introduction addressed to the user.
Present models as sorted items, along with the explanation:
1. Model identifier 1: Title 1. Explanation 2
2. Model identifier 2: Title 2. Explanation 2
3. etc.
""".strip()

SYSTEM_PROMPT_FOR_CONCLUDING = """
You should wrap up a chat session with a user.
The user has previously asked for a question and you have provided an answer by listing a set of tools from the Ersilia Model Hub that the user could use.
Simply thank the user for using the Ersilia Model Hub and encourage them to browse the Ersilia Model Hub for more (https://ersilia.io/model-hub) or go to GitHub Discussions () if they didn't find what they wanted.
In your thank-you note, briefly make a reference to their query, summarizing it in a few words.
Finally, if an error was observed while running the code, they should open a bug issue in the Ersilia Assistant GitHub repository (https://github.com/ersilia-os/ersilia-assistant/issues).
""".strip()

def query_comprehension_composer(user_query):
    prompt = {
        "system": SYSTEM_PROMPT_FOR_QUERY_COMPREHENSION,
        "user": user_query
    }
    return prompt

def reranking_and_explanation_composer(model_ids, comprehended_query):
    tldrs = []
    for model_id in model_ids:
        with open(os.path.join(TLDR_DIR, "{0}.json".format(model_id))) as f:
            data = json.load(f)
            tldrs += [(model_id, data["title"], data["tldr"])]
    user_prompt = "These are the previously ranked models with their TLDRs:\n"
    for i, tldr in enumerate(tldrs):
        user_prompt += "- {0}. {1}: {2}. {3}\n".format(i+1, tldr[0], tldr[1], tldr[2])
    user_prompt += "The user originally queried the following:\n"
    user_prompt += comprehended_query
    prompt = {
        "system": SYSTEM_PROMPT_FOR_MODEL_RERANKING_AND_EXPLANATION,
        "user": user_prompt
    }
    return prompt

def conclusion_composer(model_ids, comprehended_query):
    user_prompt = "This is the query from the user:\n"
    user_prompt += comprehended_query + "\n"
    user_prompt += "These where the model identifiers returned by the chat assistant:\n"
    user_prompt += "{0}\n".format(", ".join(model_ids))
    prompt = {
        "system": SYSTEM_PROMPT_FOR_CONCLUDING,
        "user": user_prompt
    }
    return prompt

ersilia-os / ersilia-assistant