Closed vkhodygo closed 2 years ago
@dkopasker
Could you please clarify how to calculate prevalence in this case? Also, I'd like to know how to format the output file, the example you provided is a bit confusing.
Prevalence is the proportion of the sample with a specific characteristic. If the characteristic is defined by a dummy variable, as is the case here, then the mean of the dummy variable gives the prevalence.
Regarding the format, I provided a csv example that I made in Stata. This was provided with the expectation that you would try to reproduce it using more efficient code. A separate file of column descriptions was provided to assist in this. Please let me know if there is anything specific that remains confusing. Equally, if you think a different format would be useful, I am open to suggestions.
@dkopasker
dummy variable, as is the case here
You say that
(dummy=1 if dhm<=24)
does it mean automatically
(dummy=0 if dhm>24)
I.e., you transform a continuous variable into a boolean one (sort of). I'm familiar with this kind of transformations, I just want to be sure I understand everything correctly.
Almost. The dummy could be missing if dhm is missing. Stata treats missing as infinity, so dummy=0 if dhm>24 & dhm!=. would be needed. This may not be the case with other software.
Fair enough, I keep forgetting about missing values all the time. So, the approach is to interpret all missing values as 0/null/false etc.?
One more thing, you mentioned employed as a separate group, but for them prevalence of employment makes no sense. What do I miss here?
The approach is to treat any variable determined by a missing variable as also being missing. Zero and false are not usually missing values. Null usually is.
You are correct that prevalence of employment will not vary in the employed group. Likewise, there are poverty groups where the prevalence of poverty will not vary. However, it was more efficient to run a loop covering all outcomes for all groups. This also provides a check that things are working as they should.
Fair enough.
I went through the data columns now to load only what is needed. It looks like out_poverty
is missing for some reason.
Where is out_poverty missing?
Both grp_emp
(just noticed) and out_poverty
columns are missing in one of the CSV files I converted from the .dta data.
> colnames(df)
[1] "V1" "run" "time" "id_household"
[5] "id_benefitunit" "id_female" "id_male" "id_person"
[9] "id_father" "id_mother" "id_original" "id_partner"
[13] "hh_dwt" "hh_size" "atriskofpoverty" "dhh_owned"
[17] "dhhtp_c4" "disposableincomemonthly" "equivaliseddisposableincomeyearl" "n_children_0"
[21] "n_children_1" "n_children_10" "n_children_11" "n_children_12"
[25] "n_children_13" "n_children_14" "n_children_15" "n_children_16"
[29] "n_children_17" "n_children_2" "n_children_3" "n_children_4"
[33] "n_children_5" "n_children_6" "n_children_7" "n_children_8"
[37] "n_children_9" "occupancy" "region" "bu_size"
[41] "ydses_c5" "adultchildflag" "dag" "dcpagdf"
[45] "dcpen" "dcpex" "dcpst" "dcpyy"
[49] "ded" "deh_c3" "dehf_c3" "dehm_c3"
[53] "dehsp_c3" "der" "dgn" "dhe"
[57] "dhesp" "dhm" "dlltsd" "inversemillsratiomaxfemale"
[61] "inversemillsratiomaxmale" "inversemillsratiominfemale" "inversemillsratiominmale" "laboursupplyweekly"
[65] "les_c4" "les_c7_covid" "lesdf_c4" "lessp_c4"
[69] "potentialearnings" "sindex" "sindexnormalised" "weight"
[73] "yearlyequivalisedconsumption" "ynbcpdf_dv" "yplgrs_dv" "ypnbihs_dv"
[77] "ypncp" "ypnoab" "yptciihs_dv" "scaling_factor"
I see. These outcomes must be constructed based on the rules defined at the top of this page. I have used dummy= for three outcomes to show they are dummy variables. Dummy is not the variable name, these would be out_ghqcase, out_emp, and out_poverty. There is an additional suffix to the variable name to indicate if the outcome relates to the baseline or reform arm of the scenario.
The other outcomes, those that are not dummy variables, have the variable names out_ghq, out_emphrs, and out_income. Again with the suffix attached.
Rules to construct groups were also provided, but not the variable names for the output. All group variable have the prefix grp and should be self-explanatory.
I get it, I'll fix the names tomorrow. Meanwhile, you can take a look at the first version here: 28c8ef3749dc955653e29640ebecf5e6de5bc28f. The path to the data file within the project directory should be data/S1/baseline.csv.gz
, however, this part of the code can be edited when needed.
It's a bit messy, no tests, no different scenarios at the moment, but it can produce most of the results and it's performant enough (takes about two minutes to get it done).
That seems a good start. More comments in the code would be helpful as I am less familiar wiith R.
More comments in the code would be helpful due as I am less familiar with R.
Noted, that's of great help for everyone since my R skills are rusty as well.
@dkopasker I finally have a version that calculates the metrics for the whole population, modifying it for other cases should not be that difficult. However, I never asked you about the definition of rank.
Rank numbers the observations for each outcome from highest to lowest. Equal observations are assigned the average rank. For example, rank_ghq_base orders observations on out_ghq_base from 1 to N. Ranking must be done after collapsing to one observation per run per year. N is the number of runs. The rank will be used to define percentiles relevant to a 95% confidence interval.
Good to know, I'll add this to the code.
See 62f28ba552488b834b4212f8c486c9c7e2864290 for the latest version, no ranks and no effects at the moment, but all groups are now included. This takes about 90 seconds to get done on my machine with the S1
data sample, should be fine to try it on yours.
A 115GB file of output from 1,000 runs of 1 arm of the COVID scenario is now stored on the T drive. Compressing the file was predicted to take 11 hours, but copying took only 20 minutes. We can see how long this file takes to run on our laptops, but getting R installed on the "machine up the stairs" is likely to be beneficial.
The latest commit provides rank support now. It is possible for you to run this script on your machine, 16GB of RAM is enough for 50 runs and 1 scenario. I think, even 8GB should do the trick.
The code itself is messy, it'd be great to make it more compact and clear. No proper docs at the moment, but there are comments everywhere.
We have 6 outcomes, some of which must be constructed from the LABSim output:
dhm
)dhm <= 24
)les_c4 == "EmployedOrSelfEmployed"
)laboursupplyweekly
to numerical)equivaliseddisposableincomeyearl
)atriskofpoverty == 1 || atriskofpoverty == null
)We construct 13 groups, and also include results for the whole population:
dgn == "Male"
)dgn == "Female"
)dag >= 25 && dag < 45
)dag >= 45 && dag < 65
)n_children_1-17 != 0
)n_children_1-17 == 0
or missing)les_c4 == "EmployedOrSelfEmployed"
)les_c4 == "NotEmployed"
)grp_emp == 1 && out_poverty == 1
)grp_emp == 0 && out_poverty == 1
)deh_c3 == "Low"
)deh_c3 == "Medium"
)deh_c3 == "High"
)We want:
Table structure: