Open markwhiting opened 3 months ago
Proposal: Validate
action that lets me create human ratings on papers for a given column without seeing the model's rating, then use these to check how well the model is doing and show results of that in context.
You are given papers who have not previously been validated, we store these as ground truth ratings and use them in downstream performance adjustments (e.g., DSL)
Highlight columns that are bad with some kind of coloring and on hover show details about the performance metric (F1 or R^2 etc) and score.
Here are some more details on how the types of truth and validation might work...
true
and measurement
. True is from a researcher and is considered 100% valid. A measurement is from a feature provider, e.g., GPT, and is what is validated against true. In this way, validation effectively reports measurement error.participant_source TRUTH
for each validated column, and fill in scores for nonblank validated items. New truth values overwrite old ones but by default we keep them and assume them to be true even if the feature version or provider changes.
Fundamentally we only know if a feature is good if we can compare our result with some other (presumably more trustable) result, e.g., comparing to a human ground truth rating.
We need to build in a system for checking quality and reporting quality such that users can quickly know what to trust and think about how to improve it.
Caveat: we may be able to aggregate answers across sources in some cases to validate columns.