Below I am writing my thoughts around the LLM processing of publications and metadata. Feel free to take it from here, including breaking this into smaller tasks.
Background
In the branch model-incorporation-with-llms I have incorporated the following two files:
/.github/scripts/suggest_metadata_with_llm.py (⚠️ This file has not been tested at all! I hope the structure makes sense but it will need further development for sure. For now, it is just a starting point)
This script takes as input a publication PDF and produces a summary of it, in Markdown format. It uses OpenAI in the background. The publication PDF can be specified in multiple ways, for example by model_id (it will be fetched from S3) or locally with a file path.
Given the publication summary from point 1 and raw metadata specified by the user, it creates a (much improved) metadata file in Markdown format. Here, raw metadata is defined as the current metadata for the existing models or, in the future (see below), it can be the information provided by the user at model incorporation time. The goal of this script is to improve the metadata provided by the user (which, in our experience, is typically suboptimal) with the LLM. The publication summary can be specified e.g. by model_id (it will be fetched from S3) or passed locally as a file. The same applies to the raw metadata. The script treats Markdown, YAML and JSON formats for the metadata indistinctly.
Some pending issues
[ ] I am not 100% sure these files need to go into ./github/scripts or in the eos-template. It is quite possible that the prompts evolve with time, so having them in the eos-template (and therefore replicated in every model) can be good for easy traceability, since we will know the exact prompts that were used for each model. On the other hand, it is one more script to maintain that is not fully centralized. I would discard having these files in ersilia-assistant. I would also rule out the option of creating yet another repository. One possibility would be to create a scripts/ folder in ersilia where we put these kinds of utils. But if we do that, we would need to have clarity on the differences between ./github/scripts and scripts/, and we should also ask ourselves about the extra dependencies involved, and the potential risk of disorganized growth of this subfolder. Let's decide.
[ ] There is a method in TODO in the metadata script. This method is related to fetching information by issue_number and is aimed at getting the metadata from the first comment of the metadata (see below).
[ ] Finally, I have not tested these files, especially the suggest_metadata_with_llm.py. It is likely that they will fail. However, I hope they are a good starting point.
What's the immediate application?
The immediate application is for the ersilia-assistant tool. Here, we can incorporate the LLM-generated model metadata, as well as the publication summaries, to enhance the RAG system.
In this case, the steps are quite clear:
[ ] Generate publication summaries for all current publications in the Ersilia Model Hub. These publications are expected to be found in the corresponding S3 bucket.
[ ] Generate LLM metadata for all models in the Ersilia Model Hub using the current metadata for these models, as well as publication summaries generated previously.
[ ] Index the information above in the RAG system.
What's the bigger picture?
In the long term, we want to achieve a better experience at model incorporation time. I suggest the following steps, although this is open for discussion:
User opens a Model Request issue. At this point, they are asked to provide as much information as possible, including input type, etc, that are not currently requested. Let's call this the issue header.
Once the issue header has been opened, (a) a model_id will be announced, and a link will be provided to upload a publication PDF. This link should redirect to a file submission page that will store the publication PDF in the S3 bucket. This submission can be done by the contributor or by someone from the Ersilia team.
Upon submission of the publication, a publication summary will be automatically generated. Ideally, this is generated via CI/CD using the summarize_publication_with_llm.py script.
Once the issue header is "as good as it gets" and the publication summary is available, we can generate a new version of the metadata (LLM-based) using the suggest_metadata_with_llm.py. This script will use the issue number to read the metadata from the issue header, and the model identifier generated in point 2 to fetch the publication summary from S3. The trigger for this action can be an '/annotate' comment by an Ersilia maintainer.
The point above will generate an improved metadata that should be pasted as a new comment in the issue. This metadata can be edited based on discussion. Let's call this the metadata comment.
Once the metadata comment is revised, an '/approve' trigger will generate the ersilia-os/model_id repository, and will produce the alert for the contributor in the issue as commonly done. This time, the metadata.yml will be generated from the metadata comment, instead of the issue header.
Objective(s)
Generate publication summaries with LLMs.
Generate models metadata with LLMs.
Improve the ersilia-assistant RAG system.
Improve model incorporation experience by facilitating model understanding (publications) and metadata annotation with LLMs.
Summary
Hi @DhanshreeA,
Below I am writing my thoughts around the LLM processing of publications and metadata. Feel free to take it from here, including breaking this into smaller tasks.
Background
In the branch
model-incorporation-with-llms
I have incorporated the following two files:/.github/scripts/suggest_metadata_with_llm.py
(⚠️ This file has not been tested at all! I hope the structure makes sense but it will need further development for sure. For now, it is just a starting point)/.github/scripts/summarize_publication_with_llm.py
What are these scripts doing?
model_id
(it will be fetched from S3) or locally with a file path.model_id
(it will be fetched from S3) or passed locally as a file. The same applies to the raw metadata. The script treats Markdown, YAML and JSON formats for the metadata indistinctly.Some pending issues
./github/scripts
or in the eos-template. It is quite possible that the prompts evolve with time, so having them in the eos-template (and therefore replicated in every model) can be good for easy traceability, since we will know the exact prompts that were used for each model. On the other hand, it is one more script to maintain that is not fully centralized. I would discard having these files in ersilia-assistant. I would also rule out the option of creating yet another repository. One possibility would be to create ascripts/
folder in ersilia where we put these kinds of utils. But if we do that, we would need to have clarity on the differences between./github/scripts
andscripts/
, and we should also ask ourselves about the extra dependencies involved, and the potential risk of disorganized growth of this subfolder. Let's decide.TODO
in the metadata script. This method is related to fetching information byissue_number
and is aimed at getting the metadata from the first comment of the metadata (see below).suggest_metadata_with_llm.py
. It is likely that they will fail. However, I hope they are a good starting point.What's the immediate application?
The immediate application is for the ersilia-assistant tool. Here, we can incorporate the LLM-generated model metadata, as well as the publication summaries, to enhance the RAG system.
In this case, the steps are quite clear:
What's the bigger picture?
In the long term, we want to achieve a better experience at model incorporation time. I suggest the following steps, although this is open for discussion:
model_id
will be announced, and a link will be provided to upload a publication PDF. This link should redirect to a file submission page that will store the publication PDF in the S3 bucket. This submission can be done by the contributor or by someone from the Ersilia team.summarize_publication_with_llm.py
script.suggest_metadata_with_llm.py
. This script will use the issue number to read the metadata from the issue header, and the model identifier generated in point 2 to fetch the publication summary from S3. The trigger for this action can be an '/annotate' comment by an Ersilia maintainer.metadata.yml
will be generated from the metadata comment, instead of the issue header.Objective(s)
Documentation
N/A