Watts-Lab / commonsense-platform

Commonsense platform
https://commonsense.seas.upenn.edu
1 stars 0 forks source link

PQ variability and sensitivity evaluation #69

Closed markwhiting closed 9 months ago

markwhiting commented 11 months ago

How repeatable is a given P or Q? That is to say, which of population or corpus play a bigger role in predicting a new PQ point?

One way to trial this would be to sample a population on two different corpus selections, and then sample two different populations on the same corpus, and see how these differences compare.

markwhiting commented 11 months ago

We probably want to check each of these cells to ensure that we aren't making bad assumptions about how the experiment will generalize.

same population different population
same design point checks for repeatability design point reliability
different design point people's reliability actual integrative sampling
markwhiting commented 11 months ago
  1. Recruit $N$ people in $Q$ and give all of them design points in $P$ in entirely randomized statement order.
  2. Randomly split the population into 2 groups ($q, \neg q$) (perhaps try many randomizations to see the mean effect)
  3. Check $PQ(q,p) \sim\ PQ(q,\neg p)$ — how much does the population predict the outcome on a different design point? — we would like this to be low $R^2$
  4. Check $PQ(q,p) \sim\ PQ(\neg q,p)$ — how much does the design point predict the outcome for a different population? — we would like this to be high $R^2$
markwhiting commented 11 months ago

Might also want to test how different sets of statements $q$ drawn from a single point in $Q$ are or are not repeatable with a similar design.

amirrr commented 11 months ago

What would the treatments for this case look like?

markwhiting commented 11 months ago

I think the treatment is rather simple:

  1. everyone responds to 30 statements, recruited from Turk.
  2. the statements are sampled from 2 design points, e.g., A and B. And we sample 20 (2 groups of 10) statements from each of A and B (i.e., A1, A2, B1, B2).
  3. two treatments: T1: A1, A2 and B1, T2: A1, B1 and B2.

In analysis we do:

  1. between groups T1 ~ T2 on A1 and B1. (we want this to be the same)
  2. within subjects (or group) x1 ~ x2 on all participants for whatever set they did two of. (we want this to be the same)
  3. within subjects (or group) Ax ~ Bx on all participants for whatever numbers they did. (we want this to be different)

I think the design points should be: Screenshot 2023-08-04 at 10 50 49

I think we should try the new statements to see how they go?

markwhiting commented 10 months ago

A figure for this might have something like:

  1. facets on x for design point as sampled, e.g., A1, B1, A2, B2.
  2. facets on y for treatment, e.g., T1, T2
  3. otherwise similar to the design point plots in https://observablehq.com/@wattslab/common-sense-platform-analysis#cell-2276

We could also do a comparison, e.g., AUC (area under the curve), which in this case is the front of largest cliques. Perhaps then make a table grouped by sample and treatment (A1.. and T1...) with the AUC for each row.

Treatment Statements AUC
1 A1 .5
1 B1 .8
... ... ...

Bonus challenge — work out a good way (literature supported for this kind of problem) to do an uncertainty measure (confidence intervals or compatibility intervals) for AUC.

markwhiting commented 9 months ago

untitled-18

Done! Overall, we see a nice amount of difference between design points and mostly a nice similarity between populations. We also did the same sample with GPT which is less exciting at this point.