Improved model incorporation experience (with LLMs)

Hi @DhanshreeA, I wanted to suggest a few improvements to the pipeline for model incorporation.

Background

The current model incorporation pipeline (based on GitHub Issues) works nicely but, consistently, we get poor metadata from our contributors. We should have a better annotation of the models from the inception, and I think that LLMs can greatly help us here.

In particular, we already have two scripts to (a) summarize a publication PDF and (b) generate metadata from a raw metadata file that, in my opinion, can be exploited for this.

Therefore, I suggest an update of the pipeline as specified below.

Suggested pipeline

The user opens a model request. In that model request, the form can be more simplified than it is now (for example, no need to ask for tags yet). Actually, what we need is a suggested title, a drafted description, and the links to the publication and code. This is all.
As soon as a model request is posted, we could trigger an action that modifies the title of the issue and directly assigns it a model identifier. This will have two advantages: (a) it will make it easier for all of us to find issues associated with models and (b) it will already tell the Ersilia team which is the associated identifier to this thread, which will have good implications for the next point (3).
As soon as a model identifier is available (in the title of the issue) and a link to the publication is found. The Ersilia maintainers can download the publication and store it in the S3 bucket, correspondingly. This is a necessary manual step that will ensure that we have the publication in PDF form available already at this early stage.
Then, we can start discussion in the thread as normal but, at some point, the Ersilia team can trigger a /stage command. I am calling it /stage but we can call it /pre-approve or /annotate (I like /stage - unless it is reserved by GitHub?). The goal of this command will be to launch the following LLM-based workflows: (a) produce a summary of the publication and (b) produce an improved version of the metadata based on the summary of the publication and the raw metadata provided by the user on the first message of the issue. As a result of this workflow, we will have the summary of the publication stored in S3 and the improved metadata, in Markdown format, posted as a comment in the issue thread.
The contributor will be asked to revise the improved metadata, and edit the comment correspondingly, until we are all satisfied with the content. In this improved metadata comment, a slug is provided, an improved description, tags, etc.
Finally, we can /approve the incorporation of the model. In this case, the workflows will proceed normally, that is, a new model repository will be created corresponding to the model identifier, and a comment will be posted encouraging the contributor to work on it. There are a few things to be considered. First, note that the model identifier was already generated in point 1, so there is no need to generate a new one. Second, the metadata to generate the metadata.yml file should be obtained from the improved metadata comment, not from the first comment of the issue. Finally, we may want to add a comment somewhere notifying that this procedure is assisted with an LLM.

About the LLM

I would definitely use OpenAI GPT-4o (or more) for this. There is no need to use any custom LLM or any self-hosted solution.

Objective(s)

Improve the model incorporation workflow with LLM assistance.

Documentation

N/A

ersilia-os / ersilia

🐕 Batch: An LLM-improved model incorporation pipeline #1382