ersilia-os / ersilia

The Ersilia Model Hub, a repository of AI/ML models for infectious and neglected disease research.
https://ersilia.io
GNU General Public License v3.0
219 stars 147 forks source link

🐕 Batch: Publications and Metadata processed with LLMs #1354

Open miquelduranfrigola opened 1 week ago

miquelduranfrigola commented 1 week ago

Summary

Hi @DhanshreeA,

Below I am writing my thoughts around the LLM processing of publications and metadata. Feel free to take it from here, including breaking this into smaller tasks.

Background

In the branch model-incorporation-with-llms I have incorporated the following two files:

  1. /.github/scripts/suggest_metadata_with_llm.py (⚠️ This file has not been tested at all! I hope the structure makes sense but it will need further development for sure. For now, it is just a starting point)
  2. /.github/scripts/summarize_publication_with_llm.py

What are these scripts doing?

  1. This script takes as input a publication PDF and produces a summary of it, in Markdown format. It uses OpenAI in the background. The publication PDF can be specified in multiple ways, for example by model_id (it will be fetched from S3) or locally with a file path.
  2. Given the publication summary from point 1 and raw metadata specified by the user, it creates a (much improved) metadata file in Markdown format. Here, raw metadata is defined as the current metadata for the existing models or, in the future (see below), it can be the information provided by the user at model incorporation time. The goal of this script is to improve the metadata provided by the user (which, in our experience, is typically suboptimal) with the LLM. The publication summary can be specified e.g. by model_id (it will be fetched from S3) or passed locally as a file. The same applies to the raw metadata. The script treats Markdown, YAML and JSON formats for the metadata indistinctly.

Some pending issues

What's the immediate application?

The immediate application is for the ersilia-assistant tool. Here, we can incorporate the LLM-generated model metadata, as well as the publication summaries, to enhance the RAG system.

In this case, the steps are quite clear:

What's the bigger picture?

In the long term, we want to achieve a better experience at model incorporation time. I suggest the following steps, although this is open for discussion:

  1. User opens a Model Request issue. At this point, they are asked to provide as much information as possible, including input type, etc, that are not currently requested. Let's call this the issue header.
  2. Once the issue header has been opened, (a) a model_id will be announced, and a link will be provided to upload a publication PDF. This link should redirect to a file submission page that will store the publication PDF in the S3 bucket. This submission can be done by the contributor or by someone from the Ersilia team.
  3. Upon submission of the publication, a publication summary will be automatically generated. Ideally, this is generated via CI/CD using the summarize_publication_with_llm.py script.
  4. Once the issue header is "as good as it gets" and the publication summary is available, we can generate a new version of the metadata (LLM-based) using the suggest_metadata_with_llm.py. This script will use the issue number to read the metadata from the issue header, and the model identifier generated in point 2 to fetch the publication summary from S3. The trigger for this action can be an '/annotate' comment by an Ersilia maintainer.
  5. The point above will generate an improved metadata that should be pasted as a new comment in the issue. This metadata can be edited based on discussion. Let's call this the metadata comment.
  6. Once the metadata comment is revised, an '/approve' trigger will generate the ersilia-os/model_id repository, and will produce the alert for the contributor in the issue as commonly done. This time, the metadata.yml will be generated from the metadata comment, instead of the issue header.

Objective(s)

  1. Generate publication summaries with LLMs.
  2. Generate models metadata with LLMs.
  3. Improve the ersilia-assistant RAG system.
  4. Improve model incorporation experience by facilitating model understanding (publications) and metadata annotation with LLMs.

Documentation

N/A

GemmaTuron commented 1 day ago

Hi @miquelduranfrigola

Please could you move this issue to the ersilia-assistant repository where it belongs?