Watts-Lab / commonsense-platform

Commonsense platform
https://commonsense.seas.upenn.edu
1 stars 0 forks source link

Integrative experiment registration #87

Closed markwhiting closed 2 months ago

markwhiting commented 9 months ago

A preregistration based on the As Predicted template.

Registration

Data collection. Have any data been collected for this study already?

Yes, we already collected the data. No, no data have been collected for this study yet. It's complicated. We have already collected some data but explain in Question 8 why readers may consider this a valid pre-registration nevertheless. (Note: 'Yes' is not an accepted answer.)

Hypothesis What's the main question being asked or hypothesis being tested in this study?

What types of claims are the most commonsensical, given a taxonomy of claims? Our existing hypothesis reflect those in our existing analysis: https://osf.io/9kxt2/

Dependent variable Describe the key dependent variable(s) specifying how they will be measured.

Metrics defined https://osf.io/9kxt2/

  1. Individual and Statement commonsensicality
  2. PQ common sense

With the addition of:

  1. The definite integral of PQ common sense

Conditions How many and which conditions will participants be assigned to?

Conditions are design points in the space of possible statement types (not all of which will be sampled):

2^6 13 7 = 5,824 total design points.

The stimulus for each design point will be a single set of 15 statements randomly sampled from an updated version of the corpus in https://osf.io/9kxt2/. If design points don't contain enough statements, new statements will be generated with a language model. A date stamped version of the corpus, design point samples, and acquisition pipeline is available at https://github.com/Watts-Lab/commonsense-statements.

Analyses Specify exactly which analyses you will conduct to examine the main question/hypothesis.

  1. We will compare core metrics of commonsensicality across grouping variables with the results in https://osf.io/9kxt2/. We will report 95%CI to evaluate similarity between results.
  2. We will train a model to predict the area under the PQ common sense curve for each sampled design point and analyze the models out of sample predictive accuracy using $Q^2$ for future design points. We will do this progressively as we sample more design points.

Outliers and Exclusions Describe exactly how outliers will be defined and handled, and your precise rule(s) for excluding observations.

We will exclude data of participants who provide incomplete responses or fail to meet attention checks in the survey tool.

Sample Size How many observations will be collected or what will determine sample size?

No need to justify decision, but be precise about exactly how the number will be determined.

We aim to sample at least 100 participants per design point. We intend to stop sampling when our $Q^2$ stabilizes for new design points — when adding more training data doesn't improve accuracy in an out of sample prediction of a new design point.

Other Anything else you would like to pre-register?

(e.g., secondary analyses, variables collected for exploratory purposes, unusual analyses planned?)

We intend to make registrations of predictions for each design point before sampling it.

Name Give a title for this AsPredicted pre-registration

Suggestion: use the name of the project, followed by study description.

World scale evaluation of common sense

Type of study.

Class project or assignment Experiment Survey Observational/archival study Other:

Data source

Prolific MTurk University lab Field experiment / RCT Other:

markwhiting commented 8 months ago

7 sources of claims: Category prompt, Situation prompt, ConceptNet, Atomic, News media, Campaign emails, Aphorisms.

Update to talk about: direct elicitation, in-the-wild use, corpus (which we will probably deemphasize in the future), and have GPT as an additional construct here.

markwhiting commented 8 months ago

We aim to sample at least 100 participants per design point. We intend to stop sampling when our stabilizes for new design points — when adding more training data doesn't improve accuracy in an out of sample prediction of a new design point.

We should have a state a goal but also add discussion of future batches, i.e., that we might find a better sample size and adjust accordingly.

markwhiting commented 5 months ago

Shift to do all points at start.

Collect more points on the back end.