DeadpanZiao / BioSampleManager

4 stars 2 forks source link

基于LLM的跨数据库标准化 / LLM based cross-database standardization #2

Closed DeadpanZiao closed 3 months ago

DeadpanZiao commented 4 months ago

背景

数据是深度学习的根本之一,对于 AI for Life Science 来说更是如此,没有高质量、大规模的数据作为“燃料”,AI这台“引擎”的潜力便难以发挥。其中,公共数据的筛选和收集以及本地数据的管理是保证数据质量和体量的第一条件。 高通量组学数据是 AI4S 的重要数据来源,尤其是公共数据库中有近百PB的原始数据,分布在多个数据库中,再加上这些数据有数据量巨大、复杂性高、异质性高等特点,使得有效的管理这些数据成为了一个让人头疼的问题。此外,而生物学数据的用户群体对数据调取和检索有特殊的需求,比如会根据标准化字段来筛选数据集等,这要求数据库管理工具有针对生物学的检索能力。一些软件如 ffq 和GEOparse 等虽然能够通过专业的索引号提供公共数据的Metadata获取,但是这些Metadata是非统一的,并且现阶段也没有特别完善的机制来进行本地数据库和公共数据库的实时同步和智能检索。 本项目中,我们正在构建一个 LLM 驱动的工具,服务于之江实验室和合作团队的生物学大模型训练、生物信息学科研、高可解释性生物学算法开发等工作。我们将通过 LLM 技术结合 OSS 技术实现公共数据库的实时更新、自动下载、Metadata标准化、智能本地数据管理和分发。在这个项目中,更重要的是通过我们的工作为生物学数据库管理建立更完善的行业标准。

目标

通过 LLM Agent 和领域专业知识,将多个数据库的样本Metadata字段(比如疾病类型、取样组织)整理对齐到行业标准(比如Diease Ontology),并实现 LLM Prompt 对生物学Metadata字段的自动优化等功能,以支持数据库标准化和分发。

难度

困难

导师

冯琳清 (flq@live.com)

产出要求

能力要求

Background

Data is fundamental to deep learning, especially for AI in Life Science. Without high-quality, large-scale data as "fuel," the potential of the AI "engine" is difficult to unleash. Among these, the filtering and collection of public data and the management of local data are the primary conditions to ensure data quality and volume. High-throughput omics data is a significant data source for AI in Life Science, especially with nearly hundreds of PBs of raw data in public databases, distributed across multiple databases. The characteristics of these data, such as their massive volume, high complexity, and heterogeneity, make effective data management a challenging issue. Additionally, the user base for biological data has specific requirements for data retrieval and search, such as filtering datasets based on standardized fields, necessitating database management tools with biology-specific search capabilities. While software like ffq and GEOparse can provide metadata retrieval of public data through professional index numbers, these metadata are non-uniform, and there is currently no particularly sophisticated mechanism for real-time synchronization and intelligent retrieval between local and public databases. In this project, we are building an LLM-driven tool to serve the Zhejiang Laboratory and collaborative teams in tasks such as large-scale biological model training, bioinformatics research, and the development of highly interpretable biological algorithms. We will use LLM technology combined with OSS technology to achieve real-time updates, automatic downloads, metadata standardization, intelligent local data management, and distribution of public databases. More importantly, through our work in this project, we aim to establish more comprehensive industry standards for biological database management.

Objective

By utilizing LLM Agent and domain expertise, align sample metadata fields (such as disease type and sampling organization) from multiple databases to industry standards (such as Disease Ontology), and implement LLM Prompt's automatic optimization of biological metadata fields to support database standardization and distribution.

Difficulty

Hard

Mentor

Linqing Feng (flq@live.com)

Output Requirements

github-actions[bot] commented 3 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.