Model Training Pipeline for synthetic prompt based data

Gautam-Rajeev commented 1 year ago

Features to be implemented

LLM driven model training pipeline is a project that aims to create a pipeline for training models using Language Model (LLM) prompts as input. The pipeline will facilitate the following:

Generating synthetic data required for training the model based on the LLM prompt.
Training the specified model using the generated synthetic data.

The project will also include the development of a user interface (UI) to track the progress of the pipeline, including the data creation stage and model training stage, along with relevant metrics such as the number of rows created and the number of epochs trained with accuracy.

How it works

The LLM driven model training pipeline involves the following steps:

LLM Prompt Input: The user provides an LLM prompt as input, which serves as the basis for generating synthetic data.
Synthetic Data Generation: Based on the provided LLM prompt, the pipeline generates synthetic data required for training the model. This may involve using techniques like data augmentation or text generation using LLM models.
Model Training: The pipeline trains the specified model using the generated synthetic data. This can be done using various machine learning frameworks or libraries.
User Interface (UI): A user interface is developed to track the progress of the pipeline. The UI provides real-time updates on the data creation stage, including the number of rows created, and the model training stage, including the number of epochs trained and the accuracy achieved.

d-mittal-21 commented 1 year ago

Hi @ChakshuGautam, I am interested in this project. Should I directly start working on it? Also the current product setup link is not working.

ChakshuGautam commented 1 year ago

Would be good to share how you are planning the solution here itself - you can DM me as well to brainstorm and then write your thoughts here.

d-mittal-21 commented 1 year ago

@ChakshuGautam I thought about it and here is the plan first, we can make multiple instances of the input prompt using prompt engineering techniques. Then we can use this as a prefix and use finetuned gpt3 for n number of samples generation, use cosine similarity to take m best prompts for our next stage. We will take the user’s choice of model and train it. Then we can make 2 APIs in Flask for data generation and model training which would be used in react for our frontend. I also DMed you on discord with some query

Gautam-Rajeev commented 1 year ago

Carried over here https://github.com/Samagra-Development/ai-tools/issues/144

Samagra-Development / ai-tools

Model Training Pipeline for synthetic prompt based data #82

Features to be implemented

How it works