Open Mistobaan opened 6 months ago
Nice! I agree visualization will be helpful for building some level of intuition. Is this is feature you plan to implement yourself? If not, can you make a small pr sharing the code you used to produce the shared visual?
Overview
As we read different papers and the proposed metrics it becomes challenging to empirically evaluate and compare the different strategies. I believe that a visualization tool to explore the datasets under the lenses of the different metric would be impactful in acquiring deeper insight and foster novel ideas.
As a proof of concept I created a simple UMAP of the GSM8K using nomic-text-v1.5 embeddings
The plan is to extend this work to the tiny-gsm dataset and map the original gsm8k on top of the synthetic generated one to have a first look at "data diversity" between the original dataset and the synthetic one.