Closed markwhiting closed 9 months ago
We probably want to check each of these cells to ensure that we aren't making bad assumptions about how the experiment will generalize.
same population | different population | |
---|---|---|
same design point | checks for repeatability | design point reliability |
different design point | people's reliability | actual integrative sampling |
Might also want to test how different sets of statements $q$ drawn from a single point in $Q$ are or are not repeatable with a similar design.
What would the treatments for this case look like?
I think the treatment is rather simple:
In analysis we do:
I think the design points should be:
I think we should try the new statements to see how they go?
A figure for this might have something like:
We could also do a comparison, e.g., AUC (area under the curve), which in this case is the front of largest cliques. Perhaps then make a table grouped by sample and treatment (A1.. and T1...) with the AUC for each row.
Treatment | Statements | AUC |
---|---|---|
1 | A1 | .5 |
1 | B1 | .8 |
... | ... | ... |
Bonus challenge — work out a good way (literature supported for this kind of problem) to do an uncertainty measure (confidence intervals or compatibility intervals) for AUC.
Done! Overall, we see a nice amount of difference between design points and mostly a nice similarity between populations. We also did the same sample with GPT which is less exciting at this point.
How repeatable is a given P or Q? That is to say, which of population or corpus play a bigger role in predicting a new PQ point?
One way to trial this would be to sample a population on two different corpus selections, and then sample two different populations on the same corpus, and see how these differences compare.