背景

数据是深度学习的根本之一，对于 AI for Life Science 来说更是如此，没有高质量、大规模的数据作为“燃料”，AI这台“引擎”的潜力便难以发挥。其中，公共数据的筛选和收集以及本地数据的管理是保证数据质量和体量的第一条件。高通量组学数据是 AI4S 的重要数据来源，尤其是公共数据库中有近百PB的原始数据，分布在多个数据库中，再加上这些数据有数据量巨大、复杂性高、异质性高等特点，使得有效的管理这些数据成为了一个让人头疼的问题。此外，而生物学数据的用户群体对数据调取和检索有特殊的需求，比如会根据标准化字段来筛选数据集等，这要求数据库管理工具有针对生物学的检索能力。一些软件如 ffq 和GEOparse 等虽然能够通过专业的索引号提供公共数据的Metadata获取，但是这些Metadata是非统一的，并且现阶段也没有特别完善的机制来进行本地数据库和公共数据库的实时同步和智能检索。本项目中，我们正在构建一个 LLM 驱动的工具，服务于之江实验室和合作团队的生物学大模型训练、生物信息学科研、高可解释性生物学算法开发等工作。我们将通过 LLM 技术结合 OSS 技术实现公共数据库的实时更新、自动下载、Metadata标准化、智能本地数据管理和分发。在这个项目中，更重要的是通过我们的工作为生物学数据库管理建立更完善的行业标准。

目标

通过 LLM Agent 和领域专业知识，将多个数据库的样本Metadata字段（比如疾病类型、取样组织）整理对齐到行业标准（比如Diease Ontology），并实现 LLM Prompt 对生物学Metadata字段的自动优化等功能，以支持数据库标准化和分发。

难度

困难

导师

冯琳清 (flq@live.com)

产出要求

拆解任务步骤，设计实现基于 LLM 的 agent 处理不同形式的 metadata
通过 agent 实现数据清洗，对齐到行业标准，并设计本地数据库管理已有数据资产
实现 LLM Prompt对生物学 Metadata 字段的自动优化

能力要求

编程语言：熟悉 Python 语言及其相关库（如 requests、json、bs4）
AI agent 框架：了解 AutoGPT，Langchain Agent 等 AI agent 框架
数据库开发：有一定的 mySQL 数据库开发经验
提示词工程：有使用提示词工程在实际应用中提升过模型效果的经验

Background

Data is fundamental to deep learning, especially for AI in Life Science. Without high-quality, large-scale data as "fuel," the potential of the AI "engine" is difficult to unleash. Among these, the filtering and collection of public data and the management of local data are the primary conditions to ensure data quality and volume. High-throughput omics data is a significant data source for AI in Life Science, especially with nearly hundreds of PBs of raw data in public databases, distributed across multiple databases. The characteristics of these data, such as their massive volume, high complexity, and heterogeneity, make effective data management a challenging issue. Additionally, the user base for biological data has specific requirements for data retrieval and search, such as filtering datasets based on standardized fields, necessitating database management tools with biology-specific search capabilities. While software like ffq and GEOparse can provide metadata retrieval of public data through professional index numbers, these metadata are non-uniform, and there is currently no particularly sophisticated mechanism for real-time synchronization and intelligent retrieval between local and public databases. In this project, we are building an LLM-driven tool to serve the Zhejiang Laboratory and collaborative teams in tasks such as large-scale biological model training, bioinformatics research, and the development of highly interpretable biological algorithms. We will use LLM technology combined with OSS technology to achieve real-time updates, automatic downloads, metadata standardization, intelligent local data management, and distribution of public databases. More importantly, through our work in this project, we aim to establish more comprehensive industry standards for biological database management.

Objective

By utilizing LLM Agent and domain expertise, align sample metadata fields (such as disease type and sampling organization) from multiple databases to industry standards (such as Disease Ontology), and implement LLM Prompt's automatic optimization of biological metadata fields to support database standardization and distribution.

Difficulty

Hard

Mentor

Linqing Feng (flq@live.com)

Output Requirements

Break down tasks into steps, design and implement LLM-based agent to handle various forms of metadata.
Use the agent for data cleaning, aligning to industry standards, and design local database management for existing data assets.
Implement LLM Prompt for automatic optimization of biological metadata fields.
Skill Requirements
Programming Language: Familiar with Python
AI Agent Framework: Understanding of AI agent frameworks like AutoGPT, Langchain Agent.
Database Development: Experience with MySQL database development.
Prompt Engineering: Experience in using prompt engineering to enhance model effectiveness in practical applications.

DeadpanZiao / BioSampleManager

基于LLM的跨数据库标准化 / LLM based cross-database standardization #2

背景

目标

难度

导师