apache / dolphinscheduler

Apache DolphinScheduler is the modern data orchestration platform. Agile to create high performance workflow with low-code
https://dolphinscheduler.apache.org/
Apache License 2.0
12.88k stars 4.63k forks source link

[Feature][Task Plugin] Can DolphinScheduler support the integration of TASK components for hosting large language models (LLM), thereby bringing more LLM-related capabilities to DolphinScheduler #16497

Closed SEZ9 closed 1 month ago

SEZ9 commented 2 months ago

Search before asking

Description

For large language model (LLM) hosting services, DolphinScheduler's data scheduling tasks present two key opportunities:

  1. Integrating services like Amazon Bedrock as a task plugin, where Bedrock can support fine-tuning of LLM models. After upstream data is orchestrated and processed, it can directly invoke Bedrock for fine-tuning and model evaluation, such as in scenarios involving fine-tuning models like LLaMA 3 or Claude 3.

  2. Leveraging LLMs to process multimodal capabilities, handling unstructured data like images and text, extracting and structuring the output for downstream processing.

Use case

  1. Fine-Tuning LLMs with Amazon Bedrock Use Case: Automate the fine-tuning process of large language models (e.g., LLaMA 3, Claude 3) using Amazon Bedrock. Scenario: After processing and orchestrating upstream data, DolphinScheduler triggers a task that uses Bedrock's fine-tuning service. The task fine-tunes the LLM based on specific datasets and performs model evaluation, all within a seamless workflow.
  2. Multimodal Data Processing Use Case: Process and structure unstructured multimodal data using LLMs. Scenario: DolphinScheduler integrates LLMs to handle unstructured data, such as images, text, and videos. The LLM processes this data, extracting meaningful information and converting it into structured formats for downstream applications, such as databases or analytical tools.
  3. Automated Content Moderation Use Case: Implement content moderation workflows using LLMs to analyze and filter content. Scenario: Content from various sources (text, images, videos) is scheduled for moderation tasks. DolphinScheduler orchestrates these tasks, where LLMs are used to analyze the content, detect inappropriate material, and flag or remove it according to predefined rules.
  4. Real-Time Data Enrichment Use Case: Enhance real-time data streams with contextual information using LLMs. Scenario: DolphinScheduler orchestrates data streams from IoT devices, social media, or other sources. LLMs are used to enrich this data with additional context, such as sentiment analysis or object recognition, before forwarding it to real-time analytics systems.
  5. Automated Document Processing Use Case: Streamline the processing of large volumes of documents using LLMs. Scenario: Documents such as contracts, reports, or emails are ingested by DolphinScheduler and passed to LLMs for processing. The LLMs extract key information, summarize content, and categorize documents, automating tasks like compliance checks or data entry.

Related issues

No response

Are you willing to submit a PR?

Code of Conduct

github-actions[bot] commented 1 month ago

This issue has been automatically marked as stale because it has not had recent activity for 30 days. It will be closed in next 7 days if no further activity occurs.

github-actions[bot] commented 1 month ago

This issue has been closed because it has not received response for too long time. You could reopen it if you encountered similar problems in the future.