Closed lbeziaud closed 12 months ago
@lbeziaud Thansk for your submission, we'll assign an editor soon and very sorry for the long delay.
@oliviaguest Would you like to edit this submission? Note that is a failed replication and we might need to contact original authors at some point if they want to give extra information.
Note that @LukasWallrich propose himself to review (thanks)
Happy to! 😊
@lbeziaud Quick comment on the paper: you re-use the original figures which is good and ease the comparison. Problem is that we're not alllowed to do that because of copyright of the original paper (unless it is CC-BY). Could try to contact the journal to ask for permission to re-use the figures (explaining that the original article will be cited properly)??
Problem is that we're not alllowed to do that because of copyright of the original paper (unless it is CC-BY). Could try to contact the journal to ask for permission to re-use the figures (explaining that the original article will be cited properly)??
Thank you for noticing. I did not think about this issue and will contact the journal. I can produce a version without the original figures if necessary.
Note that is a failed replication and we might need to contact original authors at some point if they want to give extra information.
Regarding the failed replication, only a subset of the model (however central to the conclusions of the original paper) was not reproduced successfully (due to our misunderstanding of some instructions). We originally thought about submitting as [~Re]. We tried to contact the first two authors in May 2021 for clarification but were left without an answer.
I would be best with original figures, but we can do without it if you don't obtain permission.
@rougier We contacted the editors and were redirected to Wiley which asks for $500 for reproduction rights. We removed the original figures from our paper.
That's unbelievable. If you want, you can leave an empty box in place of the original figure and write something like "Original figure cannot be reproduced here because Wiley ask $500 for reproduction rights".
I was not expecting such an answer from the publisher... What I did was use the space to increase the size of our figures, added page numbers to the original figures, and a footnote explaining that we did not receive permission to include the figures (without mentioning the outrageous $500).
I found what seems to be a preprint at: https://cepa.stanford.edu/sites/default/files/wp15-04-v201712.pdf. I think you can reuse the figures from this one whch is prior to the publication. I'll check stanford policy on preprint but I think it should be ok.
Not clear, I'll ask them.
In consideration for your agreement to the terms and conditions contained here, Stanford grants you a personal, non-exclusive, non-transferable license to access and use the Sites. User may download material from the Sites only for User’s own personal, non-commercial use. User may not otherwise copy, reproduce, retransmit, distribute, publish, commercially exploit or otherwise transfer any material. The burden of determining that use of any information, software or any other content on the Site is permissible rests with User.
Thank you for the help investigating copyright issues! I added a link to this preprint in the footnote I was mentioning. The figures are actually black and white versions of the one published by Wiley.
I think you can include them.
I think you can include them.
The submission has been updated to include the figures from the preprint as you suggested, with updated references.
@LukasWallrich and @lbeziaud do you have any suggestions for a potential 2nd reviewer? 😊
@oliviaguest Tristan Allard asked Julien Rossi for a recommendation: Thomas Soubiran might be interested
Unfortunately, I don't have suggestions for a second reviewer. However, I had a look at the manuscript and code- I enjoyed reading and reviewing it. In general, the code is largely clear and makes sense, while some bits of the paper (especially the results) definitely need to be clarified. Below, I list all my suggestions for improvements - if you don't like any of them, feel free to explain why that is the case. Also, pls obviously let me know if anything is unclear.
The README is very helpful. It might be good to include instructions for Windows (instead of source: venv\Scripts\activate), but not essential. I could install dependencies & appreciate that you added a licence
I can generally follow the code and can re-create the figures.
More comments would be good - the ToDo comment should be checked and removed (if that difference matters and is unresolved, mention it in the paper)
Some variable names require guessing - e.g., expyield, sach & cach - I'd encourage you to use longer and clearer names, as recommended by most styleguides, but if you don't like that, comments can also help. Also, when you abbreviate statistical concepts (mean, err, etc.) it'd be great to always separate them (ach_mean rather than mach - unless I misunderstand mach)
Could you add sub-titles? The key benefit of that format over .py seems to be navigability - yet currently it is very hard to see what each block does. At least, please indicate which section replicates which of your figures - that is the most likely thing a reader of your code would like to find out.
Add abstract - and clarify claim in the introduction: not able to replicate all. Do you think your results raise serious doubts about their conclusions?
Model description: "a percentage of the (best) students is admitted" - do they just admit the best ranked applicants, with the number of offers determined by the historical acceptance rate? If so, that could be clarified without making your text any longer.
Pls clarify which authors you contacted - currently it reads like the authors of [3]. However, I expect that you also contacted the authors of [1] and think that this should be included.
4 SES - why is the third bullet point quoted relevant? Could you clarify "no success"?
How sensitive are your results to your assumption re college quality reliability? It does not seem implausible that this would be lower than own achievement reliability (though I would concur with your guess if we need to choose).
Quality update window: did 5 years not yield satisfying results (the first statement is - just about - compatible with updating a running average, so that would seem to be the assumption closer to the reporting)
Initial college quality: clarify that 130 is the standard deviation (currently it sounds like it is the variance)
Admission probability estimation: why does this need to refer to previous years? (obviously the original paper will say so, but a half-sentence here wouldn't hurt)
Could you include a table with all parameters as an appendix (also the ones clearly reported)? Given that the original paper did evidently not report the full specification, it would be helpful to have it here.
Purely a matter of taste - but might it be better to compare original results (a) with replication results (b) - that would also make the text flow more smoothly
Also, could you align the formatting more closely (ok to keep color, but particularly the scale labels should be identical, and the axes ideally be aligned so that visual comparison is easier ... or you could use the original b/w formatting so that the same legend applies to both. If you keep color, please add a legend to the figure, they should stand alone without having to read the text)
Figure 4: what is shown here? For enrolment at top-4 colleges, I would expect four lines. Probably this shows multiple runs - but could that be more prominent?
Figure 6:
Figure 7 - align figures for comparability (probably by splitting theirs into two rows like yours)
Figures 8 & 9: their version with stacked bars appears a lot more comprehensible to me - so I would suggest sticking to that. Alternatively, please find another way where the reader does not need to compare small variations in bar height all the way across the page
If you have code to recreate the full set of figures in the original paper, my tendency would be to include that into the Notebook - in case others want to build on your work (or just use a cc-licenced version of a figure)
I think your figure numbering changed at some point? At least I got confused - probably the first two comments below just require correct referenes?
Figure 3: I don't understand your conclusion - it seems that your model shows greater impact on minority students? Figure 4: here the replication clearly differs from the original - at least more than in 3 - so I don't understand how the combination is satisfactory?
Can you clarify the size of the differences between your and their results - for instance in terms of the shift in academic achievement z-scores based on SES required to match "real-world" race-based affirmative action? Something like that would seem to be needed to see whether your results should change the interpretation of their paper.
You conclude that SES affirmative action was weaker than expected - two potential reasons (you have likely thought about them, but then pls say so):
I would have expected you to conclude that open code is important? :)
Using 40 colleges is quite a small sample - and the trajectories differ widely. In terms of assessing robustness, might that be a parameter worth experimenting with?
In your figures 2-4, you have a notable divergence at the very start when the blue and orange lines move away from each other, while that is not evident in the original. Is that just a quirk of your random seed?
Section 2: completely -> complete Section 4, bottom: he -> the Initial college quality: relies -> relied
Also @oliviaguest you might need to make an editorial decision regarding how to label partial replications (or conceptual replications with minor discrepancies). I scanned #51 and they seem to report (in their initial submission) that they found quite a few discrepancies but generally concordant results as [Re] ... here that is proposed as [-Re] ... maybe there needs to be something in between?
Something like [~Re]
or [∂Re]
?
@tsoubiran would you have the time and be interested in reviewing this paper? 😊
@lbeziaud have a missed a part where you explain if you have contacted the original authors — and what they said? Might be useful to state some stuff here in the review. And let me know if you would indeed like to include them in this process. 😌
@oliviaguest We wrote to Sean Reardon on 25/05/21 and to Rachel Baker on 28/05/21, asking if they would be comfortable sharing their code. As stated in the paper we received no answer. It would indeed be interesting to get their input if possible
Maybe @oliviaguest (with the ReScience Editor in Chief hat) can contact them at the end of the review to try to get some feedback or comments.
Indeed, very good idea and I was planning on it. What do we (editors and authors, @rougier @khinsen @benoit-girard and @lbeziaud) think about maybe asking the original authors to even review it (as a 3rd reviewer, not as the 2nd)?
Thank you @LukasWallrich for the extended review. We updated the paper (commit and pdf diff with additions in blue) and the code (commit) on the basis of your comments. We copy-paste them below and provide a concise answer to each.
Unfortunately, I don't have suggestions for a second reviewer. However, I had a look at the manuscript and code- I enjoyed reading and reviewing it. In general, the code is largely clear and makes sense, while some bits of the paper (especially the results) definitely need to be clarified. Below, I list all my suggestions for improvements - if you don't like any of them, feel free to explain why that is the case. Also, pls obviously let me know if anything is unclear.
Repo
The README is very helpful. It might be good to include instructions for Windows (instead of source: venv\Scripts\activate), but not essential. I could install dependencies & appreciate that you added a licence
We are sorry but we do not have any Windows license to port the instructions.
Code
I can generally follow the code and can re-create the figures.
Code
More comments would be good - the ToDo comment should be checked and removed (if that difference matters and is unresolved, mention it in the paper)
I added comments and removed the TODO.
Some variable names require guessing - e.g., expyield, sach & cach - I'd encourage you to use longer and clearer names, as recommended by most styleguides, but if you don't like that, comments can also help. Also, when you abbreviate statistical concepts (mean, err, etc.) it'd be great to always separate them (ach_mean rather than mach - unless I misunderstand mach)
I renamed the variables.
Notebook
Could you add sub-titles? The key benefit of that format over .py seems to be navigability - yet currently it is very hard to see what each block does. At least, please indicate which section replicates which of your figures - that is the most likely thing a reader of your code would like to find out.
I restructured the notebook with sections. Most data and plot methods are now in utils.py
and results are saved.
Paper
Add abstract - and clarify claim in the introduction: not able to replicate all. Do you think your results raise serious doubts about their conclusions?
Added abstract and clarified the introduction (no doubts raised about the conclusions of the original paper but the results highlight the general need to open experimental artifacts).
Model description: "a percentage of the (best) students is admitted" - do they just admit the best ranked applicants, with the number of offers determined by the historical acceptance rate? If so, that could be clarified without making your text any longer.
Clarified, thank you.
Pls clarify which authors you contacted - currently it reads like the authors of [3]. However, I expect that you also contacted the authors of [1] and think that this should be included.
Clarified: we contacted the first two authors of [1] (which are the first and last authors of [3]).
Parameters
4 SES - why is the third bullet point quoted relevant? Could you clarify "no success"?
I removed the third bullet point. An experiment has been added (Appendix A, Fig 12) to illustrate "no success". This is not satisfactory as the result is a much smaller impact than the original model.
How sensitive are your results to your assumption re college quality reliability? It does not seem implausible that this would be lower than own achievement reliability (though I would concur with your guess if we need to choose).
Added an experiment (Fig 13, Appendix A) highlighting the negligible impact of assuming a typo (max 0.9) or not (max 0.7).
Quality update window: did 5 years not yield satisfying results (the first statement is - just about - compatible with updating a running average, so that would seem to be the assumption closer to the reporting)
Clarified. Using only the previous year (as per the model description) to update colleges' quality gives satisfying results. We assume the 5 years window is used to smooth some plots in the original paper.
Initial college quality: clarify that 130 is the standard deviation (currently it sounds like it is the variance)
Clarified. It is the standard deviation. We switch to N(μ,σ²)
Admission probability estimation: why does this need to refer to previous years? (obviously the original paper will say so, but a half-sentence here wouldn't hurt)
Clarified. Admission probability is estimated by students using a logistic regression on past admission results given the gap between student's self-assessed achievement and student's perception of college's quality.
Could you include a table with all parameters as an appendix (also the ones clearly reported)? Given that the original paper did evidently not report the full specification, it would be helpful to have it here.
Added 3 tables along the paper: student distribution, constants and experimental variables. I ommitted the other parameters such as, e.g., the formulas of the weights, because displaying them would overload the tables without helping understand the model.
Figures
Purely a matter of taste - but might it be better to compare original results (a) with replication results (b) - that would also make the text flow more smoothly
Done.
Also, could you align the formatting more closely (ok to keep color, but particularly the scale labels should be identical, and the axes ideally be aligned so that visual comparison is easier ... or you could use the original b/w formatting so that the same legend applies to both. If you keep color, please add a legend to the figure, they should stand alone without having to read the text)
Done.
Figure 4: what is shown here? For enrolment at top-4 colleges, I would expect four lines. Probably this shows multiple runs - but could that be more prominent?
The original caption was more readable: there is only one run but there is one line per college. All 40 colleges are represented, but only 4 (darker lines) use affirmative action policies.
Figure 6: * what is the logic of only basing this one on several runs? Also, why not show all the results? (Are the results means?).
Clarified. This is the only figure which requires averaging multiple runs (to reduce the noise).
- More importantly, their left-most arrow is special - you need to explain that. Did you have any students that did not enrol? (Based on the model description, those would only be students who got no offers?) What did you do with them? Maybe you should add them to this figure.
Clarified (as a sentence, our figure is the same)
- Caption: colors seem to be switched (orange is using), and the caption reads as if the colors show whether the top 4 schools use affirmative action - but that is not right, so please clarify
Done (black and white).
Figure 7 - align figures for comparability (probably by splitting theirs into two rows like yours)
Done (to the best of my abilities without weird latex spaces and without manual cropping).
Figures 8 & 9: their version with stacked bars appears a lot more comprehensible to me - so I would suggest sticking to that. Alternatively, please find another way where the reader does not need to compare small variations in bar height all the way across the page
Done.
If you have code to recreate the full set of figures in the original paper, my tendency would be to include that into the Notebook - in case others want to build on your work (or just use a cc-licenced version of a figure)
The original figures are extracted (manually) from the pdf. I do not have the code to create them.
Results
I think your figure numbering changed at some point? At least I got confused - probably the first two comments below just require correct referenes?
Clarified reproduction vs original figure numbering.
Figure 3: I don't understand your conclusion - it seems that your model shows greater impact on minority students?
Nice spot, fixed.
Figure 4: here the replication clearly differs from the original - at least more than in 3 - so I don't understand how the combination is satisfactory?
Clarified. Actually, the range (min, max) is ok, but the average is lower.
Can you clarify the size of the differences between your and their results - for instance in terms of the shift in academic achievement z-scores based on SES required to match "real-world" race-based affirmative action? Something like that would seem to be needed to see whether your results should change the interpretation of their paper.
Without a clear understanding of their process a comparison is difficult...
You conclude that SES affirmative action was weaker than expected - two potential reasons (you have likely thought about them, but then pls say so): * What happens if you don't truncate the SES weights, and actually penalise high-SES students? I think your approach makes more sense, but could imagine that the original implementation did not do that, thus strengthening the effects they got?
In order to illustrate such impact, I have added Fig 13 (Appendix A). This results in a much higher positive impact on minorities than the original model.
- Footnote 3: might this be a difference between your results and the original results? If so, could that explain the difference you observe in Figures 8 and 9?
This would result in incomparable categories but I did not try. The difference observed in Fig 8 and 9 are more likely due to an issue in understanding the policies in the original paper.
Conclusion
I would have expected you to conclude that open code is important? :)
Absolutely, done :)
Other questions
Using 40 colleges is quite a small sample - and the trajectories differ widely. In terms of assessing robustness, might that be a parameter worth experimenting with?
I am not sure that this would be worth. First, changing the number of colleges is not straightforward. Number of students, seats, applications are given with 40 colleges. Changing this parameter may require adjusting other "internal" parts of the model. Second, we believe that increasing the number of colleges would lead to similar trends because the observations on 40 colleges appear to be quite robust as we can observe from the figures where trends are rather identical for each colleges.
In your figures 2-4, you have a notable divergence at the very start when the blue and orange lines move away from each other, while that is not evident in the original. Is that just a quirk of your random seed?
The first 15 years are a "burn-in period" (from original paper) which allows the model to stabilize. This includes the logistic regression on the admission probabilities. We could play with the initial coefficients of the LR to adjust but did not consider this variation an issue (since it's a boostraping phase).
Minor language notes
Section 2: completely -> complete Section 4, bottom: he -> the Initial college quality: relies -> relied
Fixed, thank you.
Thank you for making these changes. Things are a lot clearer now. Just a few more specific points.
I appreciate the tidy-up. However, it might be good to include one example for how to run an experiment into the notebook (or README if you prefer). Currently, one needs to dig quite deep into utils.py to figure that out, which makes it harder to reuse your code.
Also, it might be good to rename the file given that utils is a standard python library (I initially searched Google for the utils make_all() function :) )
In 2 you say that affirmative action policies are introduced by some colleges, mostly the top four. What does that mean? If some experiments involve different/more colleges, then that might need to be stated more clearly.
[Btw – I find it a bit strange that in any run, only one policy is adopted? An obvious question would be how policies interact – i.e. when different colleges focus on different policies in the same year, which is what will happen in reality. I am not suggesting that you now implement that here, but it might be worth pointing towards that in the need for further research.]
I understand that it is difficult to interpret the effect size / importance of the divergence – but I still think that should be addressed. Can you say what one would need to check to figure that out?
The difference in Figure 7 seems more important than other divergences. In your model, two of the top colleges do not appear to decline in quality at all. That contradicts a strong claim in the original paper: “As figure C5 makes clear, the top four ranked colleges, which begin using affirmative action in year 15, all experience a gradual decline in their quality ranking over time, as they continue to implement very aggressive affirmative action and recruitment strategies. It is unlikely that top colleges in the real world would so willingly sacrifice their quality ranking.” If you agree with my interpretation, that should be stated explicitly.
In the conclusion, you say that “more effort are required regarding this replication” – what does that mean?
Sorry for the delay and thank you again @LukasWallrich for your feedback. I updated the paper (commit and pdf diff) and the code (commit).
Thank you for making these changes. Things are a lot clearer now. Just a few more specific points.
Code
I appreciate the tidy-up. However, it might be good to include one example for how to run an experiment into the notebook (or README if you prefer). Currently, one needs to dig quite deep into utils.py to figure that out, which makes it harder to reuse your code.
The README has been updated to include additional info, with some code cleanup to (hopefully) make running experiments easier.
Also, it might be good to rename the file given that utils is a standard python library (I initially searched Google for the utils make_all() function :) )
I could not find utils in the standard library (tested with python -I -m utils
) but renamed the file to "utils_figs.py" :)
Paper
In 2 you say that affirmative action policies are introduced by some colleges, mostly the top four. What does that mean? If some experiments involve different/more colleges, then that might need to be stated more clearly.
Removed "mostly the top four". The original paper present additional experiments with 10, 20 and 40 active colleges which I added in appendix.
[Btw – I find it a bit strange that in any run, only one policy is adopted? An obvious question would be how policies interact – i.e. when different colleges focus on different policies in the same year, which is what will happen in reality. I am not suggesting that you now implement that here, but it might be worth pointing towards that in the need for further research.]
This is actually close to a question we ask ourselves using this model, in the context of algorithmical fairness.
I understand that it is difficult to interpret the effect size / importance of the divergence – but I still think that should be addressed. Can you say what one would need to check to figure that out?
The obvious approach is to try (again) to access the original code, the alternative being trying to identify the actual source(s) of our misunderstanding. We are not domain experts but to interpret the impact I would implement your aforementioned suggestion of mixing policies. Having both colleges obeying our interpretation and the original would highlight the actual divergence in terms of their impact on students, racial and SES groups, and colleges. That being said, I think the divergence results from technical misunderstandings, which might not warrant a more in depth interpretation (but rather a clarification).
The difference in Figure 7 seems more important than other divergences. In your model, two of the top colleges do not appear to decline in quality at all. That contradicts a strong claim in the original paper: “As figure C5 makes clear, the top four ranked colleges, which begin using affirmative action in year 15, all experience a gradual decline in their quality ranking over time, as they continue to implement very aggressive affirmative action and recruitment strategies. It is unlikely that top colleges in the real world would so willingly sacrifice their quality ranking.” If you agree with my interpretation, that should be stated explicitly.
This is indeed an important distinction between our results. Figures in previous submissions were the result of only one run. I updated all of them to an average of 10 runs (as the original paper). The observation is still valid. I assume this is the result of our lower impact of the policies, which allows colleges to not admit students with the same low achievement as the original work. I added a reference to the strong claim in our results section.
In the conclusion, you say that “more effort are required regarding this replication” – what does that mean?
Reworded.
Details
* Page 4 – something is going wrong in the sentence “as in the first quite would” * On p. 6 “probability admission” should be “admission probability”? * At the end “for fairness applications” does not work. * Be consistent whether you speak of “schools” or “colleges”
Fixed, thank you.
* Your Figure 6 only has three black arrows. How do they represent 4 colleges?
Fixed. The rightmost arrow was outside the boundary.
* In Figure 7, the order of the rows does not match – that confused me quite a lot when I tried to compare the results.
Fixed.
* In Figure 10, the legends don’t match – and the distributions are strikingly different. Do you have the wrong chart from the original paper there?
Fixed. It was indeed the wrong chart 🙃. Thank you.
from your previous comments:
More importantly, their left-most arrow is special - you need to explain that. Did you have any students that did not enrol? (Based on the model description, those would only be students who got no offers?) What did you do with them? Maybe you should add them to this figure.
"The left-most arrow captures students who do not enroll in college in our simulation." Figures have been fixed to show this arrow, noting however that we miss some context regarding this arrow (are unenrolled students admitted students, students having applied, etc.).
Thanks for making these updates - I think this all looks good now and I appreciate the clarifications in the paper. Tiny point: on top of p.6 (in the PDF diff) there is a broken Appendix reference (??). @oliviaguest please let me know if you need anything else from me.
@LukasWallrich no, this is fantastic. Unless the authors need anything, you're free. And we're extremely thankful. 🌝
Tiny point: on top of p.6 (in the PDF diff) there is a broken Appendix reference (??).
Thank you for catching that. The broken reference is only present in the diff, and not in the article :) I'm still discovering latexdiff.
@LukasWallrich Thank you again for your great feedback. I think the paper and code gained a lot from it!
Hello, I switched to python 3.9 and updated the dependencies due to a security issue with joblib < 1.2.0. The code hasn't changed.
Please tell me if anything needs to be done on our side now :)
Any progress ?
@lbeziaud @LukasWallrich I am currently wondering how to solve finding a reviewer. It has been so long, and I have tried so many people. I will try to tag some Python people who have signed up to review as a last attempt, perhaps.
oh, I think perhaps @cosimameyer might be appropriate as a reviewer, if the time and capacity allow? 😊
@oliviaguest Thanks for thinking of me - I’d love to support but have only limited capacity at the moment. To make sure that I’m able to deliver the review within a helpful scope, I’d love to learn more about the requirements and the typical time frame for the review 😊
@cosimameyer what's a suitable ETA for you? Thanks for replying. ☺️
@oliviaguest Thanks for asking! Depending on the request, I think the end of April/the beginning of May would work for me. I tried to understand what the review includes but couldn’t find detailed information - do you happen to have it at hand? 😊
@cosimameyer does this help? ☺️ https://rescience.github.io/edit/
@oliviaguest it does, thanks so much! And thanks for your patience, I'm happy to help with the review ☺️
Oh, wonderful! Thank you, @cosimameyer! As you can guess/see we had so much trouble finding people. I am so pleased.
Hi everyone,
Thanks so much for your patience and for inviting me to review your paper. I enjoyed reading it and going through your code 😊
I think this journal (and the contributions) are extremely valuable (but yet often undervalued), so thanks to everyone involved in the process of running it!
I'm listing my feedback and suggestions below. I reviewed the paper and the code listed here as the latest commit dates indicate that there were changes made after the first review. If you don't agree with them, feel free to explain why and please also let me know if anything is unclear.
The guide in the README was easily accessible and I could execute the code successfully.
However, I ran into two challenges: 1) the storage capacity, and 3) the execution time of the script. I would add in the README that you need a minimum of ca. 40 GB of storage capacity to run the script because the models are cached and also add your system specifications. This might help future users to get a reference point. For instance, it took me around 1.75 hours to convert and to execute figures.ipynb
(I am working on a MacBook Pro with a 2 GHz Quad-Core Intel Core i5 and 16 GB RAM).
plot_d123
only says """Plot figures D1, D2, D3."""
but doesn't provide more info - but this may be also a matter of taste.Figure 3 as generated using the code provided (including code to understand which chunk generated the figure):
Figure 3 in the paper:
The paper is easily readable and informative. I listed some suggestions for more clarification below.
sklearn
but also relied on statsmodel
(during development) - how do the results differ? Why did you lean towards sklearn
in the end? It is mentioned in the comments above but since the paper cannot fully replicate the original study, I'm definitely supporting reaching out to the authors (once more and now also backed by the paper and the reviews) and getting their stance on the "unspecified and inconsistent parameters". I think with what you had at hand, it's still impressive to see how much of the original study you were able to replicate!
Just a general thought which doesn't have to be addressed anywhere: Reardon et al. (2018) used Stata in their original paper and it would be interesting to know whether the programming language used (and essentially the way the algorithms are written) could have had also an effect on differing results.
@cosimameyer amazing; thank you!
Thank you for your detailed feedback! I've updated the code and paper. Below is a quick answer to every point.
Repository
Code
The guide in the README was easily accessible and I could execute the code successfully. However, I ran into two challenges: 1) the storage capacity, and 3) the execution time of the script. I would add in the README that you need a minimum of ca. 40 GB of storage capacity to run the script because the models are cached and also add your system specifications. This might help future users to get a reference point. For instance, it took me around 1.75 hours to convert and to execute
figures.ipynb
(I am working on a MacBook Pro with a 2 GHz Quad-Core Intel Core i5 and 16 GB RAM).
Sorry for that!
I improved the caching with manual storage as parquet in a human-readable path instead of using joblib.
The cache now takes ~ 10 GB.
I have also added a script pre_run.py
which can create all required data with parallel processing.
I have added an indication of the disk and time requirements in the README.
On my laptop (AMD Ryzen 7 PRO 6850U) generating the data with 15 threads takes 30 GB of memory and 20 minutes.
Plotting the figures stills requires some processing and takes 20 minutes (once the data is generated).
Code
- The code is complete and executable. The description is sufficient to run the code successfully. At times, I would have wished for a bit more documentation. For instance, the docstring in
plot_d123
only says"""Plot figures D1, D2, D3."""
but doesn't provide more info - but this may be also a matter of taste.
The docstring for plot_d123
is now a bit more explicit.
Notebook
- I found it at times a bit confusing which figures relate to the respective figure in the paper. But as a general recommendation, I would name/enumerate the figures consistently and separate the notebook into "Figures in the main paper" and "Figures in the appendix" (or something similar).
I fixed the numbering to indicate when figures are in the main paper of in the appendix. Instead of Figure [X]
everywhere (which was a mistake), figures are now numbered as they are in our paper, with the appendix letter prefixed when appropriate. I kept the same order since this way figures are grouped by type, but I can split it as main/appendix if you prefer?
- I am not sure why but I think Figure 3 (see first screenshot) differs significantly from the figure provided as a replication in your paper (see second screenshot).
Figure 3 as generated using the code provided (including code to understand which chunk generated the figure):
Figure 3 in the paper:
This is related to you comment about fixing the numbering. Figure 3 in the paper is actually the second figure in the notebook (named C2 in the notebook and original paper), whereas the third figure in the notebook (named C3) is Figure 15 in Appendix B. Your first screenshot (notebook) is Figure 15 (C3) and not Figure 3 (C2). This was indeed confusing, sorry. I hope fixing the numbering in the notebook also fixes the confusion?
Paper
The paper is easily readable and informative. I listed some suggestions for more clarification below.
Introduction
- You mention in the beginning of your paper on page 1, that "while others were used inconsistently along the paper–such as affirmative actions." Could you specify in another short sentence (with an example) how they were used inconsistently?
I added an example
Model of college admission
- I would add an additional sentence making clear that the data are fully simulated - both in the original paper and in your replication. This may help a reader less familiar with agent-based simulation models to understand where the data come from.
I added a sentence
- If possible, it is good to add a comparison of the distribution of the replication and the original data to better understand whether they are comparable.
Without access to the original data I am not certain how compare the distributions. Am I misunderstanding your comment?
Method
- You mention that you used
sklearn
but also relied onstatsmodel
(during development) - how do the results differ? Why did you lean towardssklearn
in the end?
I added a footnote
Results
- Figure 2: While I agree that there's success in the replication, it is interesting to see that there seem to be reverse trends in the beginning (≤ 3 years)
I added the comment and a potential explaination
- Figure 6: You mention that "[i]t is unspecified however whether it captures students who do not enroll while having applied, being admitted or without condition. We use the latter." (p. 6) - what happens if you use the former? Do the results change?
Looking only at applications gives similar results to looking at all students. Looking only at admission gives values bigger than the original figure. I am not certain it is an interesting experiment but I got the following data if you think I should add it to the paper? It uses the same data as Figure 6 (left side but right side is identical), with the original having both arrows at ach=~800, %res=~0.6)
filter | ach | %race | %res |
---|---|---|---|
any | 990.010698 | 0.353296 | 0.598398 |
app | 979.329194 | 0.358160 | 0.591767 |
adm | 1177.594869 | 0.167540 | 0.397116 |
- Figure 7: It is not easy to compare Figure 7 with the original because of the different structure. Would it be possible to arrange it similarly to the original paper?
I edited the original figures to combine them (they were splitted accross two pages) and played with the size to align them better (with a focus on aligning y=1).
- On a general note: To be better able to compare both the replication and the original results, it might be worth the effort to add similar scales (and ticks) on the axes and to place the figures on a similar level. I know this can be tricky but it'll definitely help the reader to easily spot differences (and similarities).
I've tried to do that with "simple" customization available through matplotlib and seaborn but this is really tricky. Since I only have the data for some of the original figures (8, 9, 10, 11) I cannot reproduce them with the same libraries I used, it would have been a lot simpler than reproducing the look of Stata plots…
Additional Figures
- I like that you add additional figures but couldn't find references to them in the text (except Figure 14 which is mentioned in Section 4). I think it might be good to spend words describing the figures (and what their interpretation means) and link to them in the text (as done with Figure 14)
I've added references to figures of the appendix accross the Result section. Figures 12 and 13 are already referenced in the Difficulties section (SES‐based affirmative action).
General remarks
It is mentioned in the comments above but since the paper cannot fully replicate the original study, I'm definitely supporting reaching out to the authors (once more and now also backed by the paper and the reviews) and getting their stance on the "unspecified and inconsistent parameters". I think with what you had at hand, it's still impressive to see how much of the original study you were able to replicate!
Just a general thought which doesn't have to be addressed anywhere: Reardon et al. (2018) used Stata in their original paper and it would be interesting to know whether the programming language used (and essentially the way the algorithms are written) could have had also an effect on differing results.
That is an interesting question and something I thought about for the logistic regression part. The Stata documentation gives few details compared to the parameters offered by scikit-learn. Since the impact of the logit parameters was limited (mainly the first years, used as bootstrap only), I focused on speed and have not investigated further the replication side.
Small remarks
- "[...], which leads us to believe" (instead of "lead" on page 6 in the paragraph where you describe Figure 4)
Thank you for catching that!
Thanks so much for your responses - I really appreciate them! I will reply to your questions in more depth next week when I’m back from travelling 🙂
Thanks so much for your patience, getting back to this took longer than expected. Thanks again for carefully answering my questions - I highly appreciate it.
Please find my replies to open questions below - I'm happy to recommend the replication for publication :)
I fixed the numbering to indicate when figures are in the main paper of in the appendix. Instead of Figure [X] everywhere (which was a mistake), figures are now numbered as they are in our paper, with the appendix letter prefixed when appropriate. I kept the same order since this way figures are grouped by type, but I can split it as main/appendix if you prefer?
I'd personally prefer a split in "main" and "appendix" (it makes logically more sense for me) but will leave the decision for you/the editor - I also see reasons why the current order makes sense.
This is related to you comment about fixing the numbering. Figure 3 in the paper is actually the second figure in the notebook (named C2 in the notebook and original paper), whereas the third figure in the notebook (named C3) is Figure 15 in Appendix B. Your first screenshot (notebook) is Figure 15 (C3) and not Figure 3 (C2). This was indeed confusing, sorry. I hope fixing the numbering in the notebook also fixes the confusion?
It does, thanks!
Looking only at applications gives similar results to looking at all students. Looking only at admission gives values bigger than the original figure. I am not certain it is an interesting experiment but I got the following data if you think I should add it to the paper? It uses the same data as Figure 6 (left side but right side is identical), with the original having both arrows at ach=~800, %res=~0.6)
Thanks for taking the effort - I think it would be a great fit for the appendix!
Thank you @cosimameyer for the answer :) I modified the notebook to split between main and appendices. I also added the code used to produce the table of my previous reply and an appendix (briefly) presenting those results. (Beside that, I added an acknowledgment section)
@rougier @oliviaguest Should I modify the title to [~Re] or [∂Re] as you discussed?
@lbeziaud @oliviaguest Sorry for delay in answer. Since I did not follow the submission from start to end, could you remind me what's the status of the replication (full, partial or negative) ?
@rougier I would go with partial rather than negative. Most of the model is successfully replicated, but we are not successful regarding the actual affirmative action part (which is the subject of the paper). However, the trends are similar and the conclusions are not invalidated.
So I guess we can go with just [Re]. @oliviaguest is this ready to be published ? I can handle this part if you want.
Thank you. I really appreciate that @rougier. I don't know which label fits. What do we all think?
Original article: Reardon, Sean F., et al. "What Levels of Racial Diversity Can Be Achieved with Socioeconomic‐Based Affirmative Action? Evidence from a Simulation Model." Journal of Policy Analysis and Management 37.3 (2018): 630-657. https://doi.org/10.1002/pam.22056
PDF URL: https://github.com/lbeziaud/re-reardon2018/blob/master/article.pdf Metadata URL: https://github.com/lbeziaud/re-reardon2018/blob/master/metadata.yaml Code URL: https://github.com/lbeziaud/mosaic
Scientific domain: Education Policy Programming language: Python Suggested editor: