This repository serves as the main organizational tool for the survey paper "A Survey of Methods for Generating Quality and Diverse Synthetic Data with LLMs". We are collecting papers as Github issues with the tag Paper
. To add a new paper, first check that it is not present, then fill out the new paper issue template here. To close the issue you (or someone else) can make a PR containing a report on the paper using the provided format here. You can find a roadmap for the project on this Github projects board. Weekly meeting notes and recordings are housed here.
The aim of this project is to catalog the many current ad hoc methods for synthetic data generation via LLMs with a focus on understanding their impact on two metrics: dataset quality and dataset diversity. Ideally, this can be done under a single conceptual framework. An important sub-question we will need to discuss is how to appropriately define these metrics, in particular dataset diversity.
Overall, this will roughly consist of three stages:
Some important questions we will want to think about addressing:
Meetings are at 5:30 PM EST on Thursdays. Email alexdahoas@gmail.com
or DM Alex Havrilla on discord for access.