Dahoas / QDSyntheticData

11 stars 16 forks source link

Proposal: Integration of Visualization Tool for Datasets and Metric understanding #176

Open Mistobaan opened 4 months ago

Mistobaan commented 4 months ago

Overview

As we read different papers and the proposed metrics it becomes challenging to empirically evaluate and compare the different strategies. I believe that a visualization tool to explore the datasets under the lenses of the different metric would be impactful in acquiring deeper insight and foster novel ideas.

As a proof of concept I created a simple UMAP of the GSM8K using nomic-text-v1.5 embeddings

image

The plan is to extend this work to the tiny-gsm dataset and map the original gsm8k on top of the synthetic generated one to have a first look at "data diversity" between the original dataset and the synthetic one.

Dahoas commented 4 months ago

Nice! I agree visualization will be helpful for building some level of intuition. Is this is feature you plan to implement yourself? If not, can you make a small pr sharing the code you used to produce the shared visual?