TSSlade / unesco_equity

Workspace for Crouch, Slade, et al paper for UNESCO re Equity in Int'l Assessments
1 stars 0 forks source link

Excel output disagreeing with Stata summary #18

Closed ColeCampton closed 4 years ago

ColeCampton commented 4 years ago

I noticed that some outputs with new data sets had missing mean ORF values for certain subpopulations. When I investigated it seems that the output ORF means don't agree with those in the spreadsheet. For example use "$primr_src", clear summarize eq_orf if treat_phase==1 & grade==1 & cohort ==1 Reports a mean of 5.35. However the value for the mean listed in the output spreadsheet is 6.778 and is presented as 6.8 in table 2 of CS.

I have yet to identify the cause of this problem. @TSSlade

TSSlade commented 4 years ago

The command you've pasted there is not invoking the survey weights. Did you accidentally elide that from this issue, or is it possible that you're running the mean commands on naïve/unweighted data?

ColeCampton commented 4 years ago

Ah okay, i did forget the weights. Let me reassess

ColeCampton commented 4 years ago

`global DRC_src = "....\unesco_equity\data\d.PUF_DRC_Baseline_Endline Grade 2-4-6 French Sample A\PUF_3.DRC2010_2014-Baseline_Endline_grade2-4-6_EGRA-EGMA_French_SampleA.dta"

use "$DRC_src", clear

summarize orf if treat_phase ==6 & grade ==2

Variable | Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------- orf | 0

summarize orf if treat_phase ==6 & grade ==4

Variable | Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------- orf | 1,725 7.878551 13.91502 0 141

The output from from running my DRC preprocessing do file reports 32.465 for the first and empty for the second. Admittedly this is like an issue with how I am using 00_apply_analysis.do

TSSlade commented 4 years ago

What's the count of observations in each of those subpops? (Esp. the first one)

ColeCampton commented 4 years ago

1795 and 1745 respectively

TSSlade commented 4 years ago

Which disagrees with the output of the summary command above... That is seriously weird. Maybe we need a screenshare to investigate.

ColeCampton commented 4 years ago

Yes, however count if treat_phase==6 & grade==4 & !missing(orf) returns 1725 which does agree. It seems there are many missing orf values. The confusion part was that the missing values were not where I expected them in the excel sheet. I would be happy to screenshare and see what we can find tomorrow

ColeCampton commented 4 years ago

Error ended up being numerical indexing into subpopulation summary statistic matrix producing unexpected results when there are subpopulations with missing orf data. The fix was feature name indexing along with populating variables as empty when the orf data is missing.