DeadpanZiao / BioSampleManager

4 stars 2 forks source link

云-端协同平台开发 / Cloud-Edge Collaboration Platform Development #3

Closed DeadpanZiao closed 5 months ago

DeadpanZiao commented 6 months ago

背景

数据是深度学习的根本之一,对于 AI for Life Science 来说更是如此,没有高质量、大规模的数据作为“燃料”,AI这台“引擎”的潜力便难以发挥。其中,公共数据的筛选和收集以及本地数据的管理是保证数据质量和体量的第一条件。 高通量组学数据是 AI4S 的重要数据来源,尤其是公共数据库中有近百PB的原始数据,分布在多个数据库中,再加上这些数据有数据量巨大、复杂性高、异质性高等特点,使得有效的管理这些数据成为了一个让人头疼的问题。此外,而生物学数据的用户群体对数据调取和检索有特殊的需求,比如会根据标准化字段来筛选数据集等,这要求数据库管理工具有针对生物学的检索能力。一些软件如 ffq 和GEOparse 等虽然能够通过专业的索引号提供公共数据的Metadata获取,但是这些Metadata是非统一的,并且现阶段也没有特别完善的机制来进行本地数据库和公共数据库的实时同步和智能检索。 本项目中,我们正在构建一个 LLM 驱动的工具,服务于之江实验室和合作团队的生物学大模型训练、生物信息学科研、高可解释性生物学算法开发等工作。我们将通过 LLM 技术结合 OSS 技术实现公共数据库的实时更新、自动下载、Metadata标准化、智能本地数据管理和分发。在这个项目中,更重要的是通过我们的工作为生物学数据库管理建立更完善的行业标准。

目标

通过OSS和SQL技术,搭建API和管理云-边管理工具,为生物学数据库文件的管理和分发提供支持。实现包括但不限于,数据库文件的自动冷热储存管理、生物学OSS-SQL数据库串联API调用等功能。

难度

困难

导师

冯琳清 (flq@live.com)

产出要求

能力要求

Background

Data is fundamental to deep learning, especially for AI in Life Science. Without high-quality, large-scale data as "fuel," the potential of the AI "engine" is difficult to unleash. Among these, the filtering and collection of public data and the management of local data are the primary conditions to ensure data quality and volume. High-throughput omics data is a significant data source for AI in Life Science, especially with nearly hundreds of PBs of raw data in public databases, distributed across multiple databases. The characteristics of these data, such as their massive volume, high complexity, and heterogeneity, make effective data management a challenging issue. Additionally, the user base for biological data has specific requirements for data retrieval and search, such as filtering datasets based on standardized fields, necessitating database management tools with biology-specific search capabilities. While software like ffq and GEOparse can provide metadata retrieval of public data through professional index numbers, these metadata are non-uniform, and there is currently no particularly sophisticated mechanism for real-time synchronization and intelligent retrieval between local and public databases. In this project, we are building an LLM-driven tool to serve the Zhejiang Laboratory and collaborative teams in tasks such as large-scale biological model training, bioinformatics research, and the development of highly interpretable biological algorithms. We will use LLM technology combined with OSS technology to achieve real-time updates, automatic downloads, metadata standardization, intelligent local data management, and distribution of public databases. More importantly, through our work in this project, we aim to establish more comprehensive industry standards for biological database management.

Objective

Using OSS and SQL technologies, build APIs and cloud-edge management tools to support the management and distribution of biological database files. This includes, but is not limited to, automatic cold and hot storage management of database files, and integration of biological OSS-SQL databases through API calls.

Difficulty

Hard

Mentor

Linqing Feng (flq@live.com)

Output Requirements

Skill Requirements

github-actions[bot] commented 5 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.