Watts-Lab / atlas

The product of all our research cartography
https://atlas.seas.upenn.edu
GNU Affero General Public License v3.0
1 stars 1 forks source link

Set up feature Validation #80

Open markwhiting opened 3 months ago

markwhiting commented 3 months ago

Fundamentally we only know if a feature is good if we can compare our result with some other (presumably more trustable) result, e.g., comparing to a human ground truth rating.

We need to build in a system for checking quality and reporting quality such that users can quickly know what to trust and think about how to improve it.

Caveat: we may be able to aggregate answers across sources in some cases to validate columns.

markwhiting commented 3 months ago

Proposal: Validate action that lets me create human ratings on papers for a given column without seeing the model's rating, then use these to check how well the model is doing and show results of that in context.

You are given papers who have not previously been validated, we store these as ground truth ratings and use them in downstream performance adjustments (e.g., DSL)

Highlight columns that are bad with some kind of coloring and on hover show details about the performance metric (F1 or R^2 etc) and score.

markwhiting commented 1 month ago

Here are some more details on how the types of truth and validation might work...

  1. Let's use only 2 types of data, true and measurement. True is from a researcher and is considered 100% valid. A measurement is from a feature provider, e.g., GPT, and is what is validated against true. In this way, validation effectively reports measurement error.
  2. For items with truth, we want to use the appropriate metric to check how good the measurement is. If the item is numerical, $R^2$. If the item is categorical, we will use unbiased multiclass $F_1$, and if the item is verbal, we will use a GPT comparison, e.g., "Do these things seem similar: yes or no?"
  3. Validation looks like 1) download a view as CSV. 2) create a new column with a validation, e.g., participant_source TRUTH for each validated column, and fill in scores for nonblank validated items. New truth values overwrite old ones but by default we keep them and assume them to be true even if the feature version or provider changes.