Identify pareto optimal tradeoff frontier for quality and diversity
Identify optimal QD hyperparamters for downstream fine-tuning performance
Procedure:
Fix a large synthetically generated dataset S
Fix quality metric and diversity metric.
Fix a sample budget N and subsample S to produce training datasets T_1,...,T_k with varying quality and diversity
Fine-tune pre-trained model M separately on T_1, ..., T_k and record test performance.
Repeat for multiple sample budgets N_1,...,N_l.
Identify optimal QD parameters for fine-tuning at each sample budget and try to fit functional form/predict optimal parameters for new sample budgets (QD scaling law?).
For diversity metric try defining equivalence of two solutions via their order of arithmetic operations. Diversity of a dataset is then the number of unique solutions.
Some more food for thought:
Might be nice to have distinct testing regimes: in distribution test and OOD test. Hypothesis: higher quality will correlate better with better in-distribution test performance and higher diversity will correlate better with higher OOD performance.
Hi @Dahoas,
I was checking this issue. This involves a couple fine-tuning/training tasks that I already have the code for. I am interested in taking this one up.
Experiment design:
S
N
and subsampleS
to produce training datasetsT_1,...,T_k
with varying quality and diversityM
separately onT_1
, ...,T_k
and record test performance.N_1
,...,N_l
.Some more food for thought: