Analysis of Reproduction Attempts

Peter-Kedron commented 1 year ago

@SarahBardin , can you also recreate both Table 7s using only the subset of n=102 respondents that reported attempting a reproduction. I am starting to write this section using the current table because the overarching picture won't change, but I will need the final real numbers eventually. Thanks.

Peter-Kedron commented 1 year ago

@SarahBardin and @josephholler, For discussion Thursday, I would like to make some kind of collective study outcome measure from Q11. I am thinking of two conservative measures and a liberal one. Most conservative = Identical-All Less conservative = Identical-All + Partial All Liberal = Identical-All + Partial All + Identical-Some + Partial-Some

My inclination is to use the less conservative and the liberal in the analysis. The most conservative is closer to computational reproduction. I think the less conservative captures close the idea of checking for support/error, particularly because we gave the similar result same conclusion explanation in the question.

Problem is I don't think we can add these because they are not exclusive of one another.

I think I would like to cross that with the access measures from Q12. So how many studies that reported having data, procedure, and environment info reported which of the outcome measures constructed from Q11. P(Q12|Q11).

Need to talk about coding of Q10 (why did you attempt) and Q12a (why did you submit for pub).

Peter-Kedron commented 1 year ago

@SarahBardin, I did some work on these questions today, so wait on doing and coding your end so you don't waste time.

Peter-Kedron commented 1 year ago

@SarahBardin and @josephholler, After some inspection from my end, here is what I am thinking for the heart of the analysis of the actual reproduction attempts.

1) Let's construct an outcome variable for the reproduction attempts that has the following groupings.

Conservative Reproduction (CONS) (11-1=ALL & 11-2=None) -> All results (R) & conclusions (C) are identically reproduced

(11-1=ALL & 11-2=Some) ->All R & C identically reproduced, but only some R&C are partially reproduced
(11-1=All & 11-2=All) -> All R & C identically and partially reproduced
(11-1=Some & 11-2=ALL) -> some R&C are identically reproduced, but all R&C are partially reproduced
(11-1=None &11-2=ALL). -> no R&C are identically reproduced, but all R&C are at least partially reproduced.

The 2nd to 4th parts of that equation are a little odd, but some respondents did give those answers (not too many). However, I think this roughly collectively captures the idea that "All results and conclusions in the study are at least partially reproduced"

Liberal Reproduction (LIB)
CONS+ (11-1=Some & 11-2=Some) -> Some R&C identically reproduced and some R&C partially reproduced

(11-1=Some & 11-2=None) -> Some R&C identically reproduced, but no R&C partially reproduced
(11-1=None & 11-2=Some) -> No R&C identically reproduced, but some R&C partially reproduced

This liberal definition would include CONS, but we can code it to include only the three new part and just add CONS. This coding would allow us to distinguish between the two. I think this roughly collectively captures the idea that "At least some results and conclusions from a study are partially reproduced"

No Reproduction (NONE) (11-1=None, 11-2-None) -> Nothing was even partially reproduced

We can haggle on the groupings if you like.

2) I am thinking we can look at those as part of a conditional probability calculation. For example, P(CONS| D,P,E). That is given data (D, from 12-1), procedure (P, from 12-2), and environment (E, from 12-3). Now D, P, and E all have responses for All, Some, None. What I did quickly was create the full set of 27 possible combinations and count the responses for each of the three outcomes CONS, LIB, and NONE. That way we can build a range of conditional probabilities from the sample to ask a bunch of questions. We can also subset by subfield and approach - although the subgroups get small.

So the second step it to code out the 27x6 table that presents the permutations as an intermediate table we can explore and slice into conditional probabilities. Let's do it for the whole sub-sample first.

3) We don't need to code 12a as there are not enough quality responses to make it worth anything.

4) We can read and devise a simple coding system for Q10, which tracks different motivations for the attempts.

Presentation wise in the paper, I am thinking P1) present the sub-sample descriptives. Slip in how many submitted for publication P2) present the motivation (Q-10 coding) P3) a) present a sentence or two of descriptives on access to research artifacts (D,P,E). b) present the outcome descriptives. c) Slice by subfield/approach if interesting P4-...) present the interesting conditionals (accompany with table or chart summarizing the data). Maybe subset also by those submitted for publication (e.g., were more non-reproductions submitted, or was reproduction a check of own work prior to submission)

SarahBardin commented 1 year ago

@Peter-Kedron and @josephholler , I've pushed a new analysis_10_coding.csv coding workbook to the public data folder. If you could save a copy with your initials and push to GitHub when done, I'll reconcile differences on Monday/Tuesday.

Here is the proposed coding scheme, which creates 4 categories instead of Peter's proposed 3 (plus a missing category) 1) "Verification/Peer-Review" -- anyone who did a reproduction to check another researcher's or published work to check the accuracy of the results 2) "Self-check/Promote transparency of own work" -- anyone who did a reproduction of their own work 3) "Interest/Curiosity" -- anyone who did a reproduction due to interest or curiosity on the method or topic 4) "Missing" -- missing values or otherwise uncodable reasons.

-Sarah

SarahBardin commented 1 year ago

@Peter-Kedron and @josephholler , having now completed the coding, I think we should actually use a different scheme. See revised coding schema below:

1) Verification/Peer-Review 2) Self-check/Promote transparency of own work 3) Replication 4) Teaching/Learning 5) Missing

I realized that several explanations were related to wanting to reuse a method or approach in a new area or with new data, and that many explanations were more about the reproduction serving as an opportunity to deepen knowledge/understanding of a method or data set or teaching good practices to students, rather than to validate the work.

Peter-Kedron commented 1 year ago

@SarahBardin Uploaded my coding. Sorry, I forgot to put it up yesterday.

josephholler commented 1 year ago

I coded Q10 in analysis_10_coding_jh.csv. I initially followed @SarahBardin 4-class + missing scheme enumerated above (in rp_intent column).

I then split out each code into a separate variable because of the complexity of some responses as follows: 1.0 Verification of any research internal validity 1.1 Verification of other's research internal validity 1.2 Verification of own research or own lab's research internal validity

Open science, reproducibility, transparency, or verification thereof
Replication / comparison of research results over time/space/scale/methodology (external validity) 3.2 Specifically replicating one's own research
teaching/learning/interest in other methods 4.1 reapplication of methods independent of verifying/validating results
. In the expanded scheme, I created an overall verification aim (1) with sub-aims 1.1, verify others and 1.2, verify own. I reserved category 2 for open science, transparency, and verification of reproducibility (but not research quality or results).

SarahBardin commented 1 year ago

Thanks, @Peter-Kedron and @josephholler ! I've cleaned up the folder structure a bit and have added a full_q10_coding.csv file to the analysis data file folder. I've retained our primary coding (rp_intent, SB_rp_intent, and PK_rp_intent) and included variable that identify if there were any discrepancies across our three coding (any_diff) and whether we had zero agreement across the 3 of us as to how to code a response (no_agreement). Luckily, it is very rare that we had no agreement on how to code a response, meaning in most cases 2 of the 3 of us were in agreement about how to code the responses, and for a majority of responses all 3 of us were in agreement (any_diff == 0).

I've also kept Joe's additional coding scheme and our notes in this version.

-Sarah

Peter-Kedron commented 1 year ago

@SarahBardin and @josephholler, if either of you go in and edit the overleaf results sections remember to pull check the Git Pull first. I just pulled and pushed from it at 4pm MST, so it should be stable now. Just a reminder because I know you two aren't in there as much, and I have messed it all up before.

josephholler commented 1 year ago

The main challenges in the 4-category coding system are:

someone is clearly checking/verifying results, but it is not clear if the results are their own (2) or others' (1)
there is clearly a repetition of the analysis by changing data resolution, context, classification system, or other methodological parameter(s), but is it an intentional replication (3), and to verify one's own work (2), or others' work (1)?

josephholler commented 1 year ago

I added a column consensus for proposed resolution of conflicts, and supplanted my notes with thoughts on conflicts in this commit: 7847f127a22e67d425e0ed8f906de1f52744cdf0

SarahBardin commented 1 year ago

@Peter-Kedron, I've pushed new Table 7 output to the results folder that reflects the final consensus coding. I also slacked you some clarifying questions. Let me know if you need me to make any adjustments to the current output.

SarahBardin commented 1 year ago

We determined on Thursday that the existing tables were sufficient and no further disaggregation of information was needed.

HEGSRR / OR-Reproducibility-in-Geography-Survey

Analysis of Reproduction Attempts #7