MRC-CSO-SPHSU / effect_estimates

GNU General Public License v3.0
1 stars 1 forks source link

Main tasks #1

Closed vkhodygo closed 2 years ago

vkhodygo commented 2 years ago

We have 6 outcomes, some of which must be constructed from the LABSim output:

  1. mean of continuous GHQ12 score (variable dhm)
  2. prevalence (mean of a dummy) of GHQ12 caseness (dummy=1 if dhm <= 24)
  3. prevalence (mean of a dummy) of employment (dummy=1 if les_c4 == "EmployedOrSelfEmployed")
  4. mean hours worked (convert laboursupplyweekly to numerical)
  5. mean household income (variable equivaliseddisposableincomeyearl)
  6. prevalence (mean of a dummy) of poverty (dummy =1 if atriskofpoverty == 1 || atriskofpoverty == null)

We construct 13 groups, and also include results for the whole population:

  1. male (dgn == "Male")
  2. female (dgn == "Female")
  3. age 25-44 (dag >= 25 && dag < 45)
  4. age 45-64 (dag >= 45 && dag < 65)
  5. household with children (one of n_children_1-17 != 0)
  6. household without children (all of n_children_1-17 == 0 or missing)
  7. employed (les_c4 == "EmployedOrSelfEmployed")
  8. unemployed (les_c4 == "NotEmployed")
  9. in-work poverty (grp_emp == 1 && out_poverty == 1)
  10. out-of-work poverty (grp_emp == 0 && out_poverty == 1)
  11. Low education (deh_c3 == "Low")
  12. medium education (deh_c3 == "Medium")
  13. high education (deh_c3 == "High")

We want:

  1. aggregate mean outcomes for the baseline and reform, in each year, and year-run pairs.
  2. Effect estimates (difference in aggregate outcomes between the baseline and reform), again for each year, and each year-run pair.
  3. Ranking of outcomes and effect estimates, again for each year, and each year-run pair.

Table structure:

scenario run time grp_all grp_male grp_female grp_age25 grp_age45 grp_hchild grp_nchild grp_emp grp_unemp grp_povin grp_povout grp_edlow grp_edmed grp_edhi out_ghq_base out_ghq_ref eff_ghq rank_ghq_base rank_ghq_ref rank_eff_ghq out_ghqcase_base out_ghqcase_ref eff_ghqcase rank_ghqcase_base rank_ghqcase_ref rank_eff_ghqcase out_emp_base out_emp_ref eff_emp rank_emp_base rank_emp_ref rank_eff_emp out_emphrs_base out_emphrs_ref eff_emphrs rank_emphrs_base rank_emphrs_ref rank_eff_emphrs out_income_base out_income_ref eff_income rank_income_base rank_income_ref rank_eff_income out_poverty_base out_poverty_ref eff_emp rank_poverty_base rank_poverty_ref rank_eff_poverty
scenario number or description the run number, missing for overall results combining all runs this is the year dummy variable (1 if the results relate to the whole population, zero otherwise) dummy variable (1 if the results relate to the male population, zero otherwise) dummy variable (1 if the results relate to the female population, zero otherwise) dummy variable (1 if the results relate to the age 25-44 population, zero otherwise) dummy variable (1 if the results relate to the age 45-64 population, zero otherwise) dummy variable (1 if the results relate to households with children in population, zero otherwise) dummy variable (1 if the results relate to households without children in population, zero otherwise) dummy variable (1 if the results relate to the employed population, zero otherwise) dummy variable (1 if the results relate to the unemployed population, zero otherwise) dummy variable (1 if the results relate to the in-work poverty population, zero otherwise) dummy variable (1 if the results relate to the out-of-work poverty population, zero otherwise) dummy variable (1 if the results relate to the low education population, zero otherwise) dummy variable (1 if the results relate to the medium education population, zero otherwise) dummy variable (1 if the results relate to the high education population, zero otherwise) mean of continuous GHQ12 score for baseline mean of continuous GHQ12 score for the reform effect of reform on continuous GHQ12 score for this run and year (out_ghq_base minus out_ghq_reform) rank of out_ghq_base for group and year rank of out_ghq_ref for group and year rank of eff_ghq for group and year mean of dummy GHQ12 caseness for baseline mean of dummy GHQ12 caseness for the reform effect of reform on dummy GHQ12 caseness for this run and year (out_ghqcase_base minus out_ghqcase_reform) rank of out_ghqcase_base for group and year rank of out_ghqcase_ref for group and year rank of eff_ghqcase for group and year mean of employment dummy for baseline mean of employment dummy for the reform effect of reform on employment dummy for this run and year (out_emp_base minus out_emp_reform) rank of out_emp_base for group and year rank of out_emp_ref for group and year rank of eff_emp for group and year mean of employment hours for baseline mean of employment hours for the reform effect of reform on employment hours for this run and year (out_emp_base minus out_emp_reform) rank of out_emphrs_base for group and year rank of out_emphrs_ref for group and year rank of eff_emphrs for group and year mean of income for baseline mean of income for the reform effect of reform on income for this run and year (out_income_base minus out_income_reform) rank of out_income_base for group and year rank of out_income_ref for group and year rank of eff_income for group and year mean of poverty dummy for baseline mean of poverty dummy for the reform effect of reform on poverty dummy for this run and year (out_poverty_base minus out_poverty_reform) rank of out_poverty_base for group and year rank of out_poverty_ref for group and year rank of eff_poverty for group and year
vkhodygo commented 2 years ago

@dkopasker

Could you please clarify how to calculate prevalence in this case? Also, I'd like to know how to format the output file, the example you provided is a bit confusing.

dkopasker commented 2 years ago

Prevalence is the proportion of the sample with a specific characteristic. If the characteristic is defined by a dummy variable, as is the case here, then the mean of the dummy variable gives the prevalence.

Regarding the format, I provided a csv example that I made in Stata. This was provided with the expectation that you would try to reproduce it using more efficient code. A separate file of column descriptions was provided to assist in this. Please let me know if there is anything specific that remains confusing. Equally, if you think a different format would be useful, I am open to suggestions.

vkhodygo commented 2 years ago

@dkopasker

dummy variable, as is the case here

You say that

(dummy=1 if dhm<=24)

does it mean automatically

(dummy=0 if dhm>24)

I.e., you transform a continuous variable into a boolean one (sort of). I'm familiar with this kind of transformations, I just want to be sure I understand everything correctly.

dkopasker commented 2 years ago

Almost. The dummy could be missing if dhm is missing. Stata treats missing as infinity, so dummy=0 if dhm>24 & dhm!=. would be needed. This may not be the case with other software.

vkhodygo commented 2 years ago

Fair enough, I keep forgetting about missing values all the time. So, the approach is to interpret all missing values as 0/null/false etc.?

One more thing, you mentioned employed as a separate group, but for them prevalence of employment makes no sense. What do I miss here?

dkopasker commented 2 years ago

The approach is to treat any variable determined by a missing variable as also being missing. Zero and false are not usually missing values. Null usually is.

You are correct that prevalence of employment will not vary in the employed group. Likewise, there are poverty groups where the prevalence of poverty will not vary. However, it was more efficient to run a loop covering all outcomes for all groups. This also provides a check that things are working as they should.

vkhodygo commented 2 years ago

Fair enough.

I went through the data columns now to load only what is needed. It looks like out_poverty is missing for some reason.

dkopasker commented 2 years ago

Where is out_poverty missing?

vkhodygo commented 2 years ago

Both grp_emp (just noticed) and out_poverty columns are missing in one of the CSV files I converted from the .dta data.

> colnames(df)
 [1] "V1"                               "run"                              "time"                             "id_household"                    
 [5] "id_benefitunit"                   "id_female"                        "id_male"                          "id_person"                       
 [9] "id_father"                        "id_mother"                        "id_original"                      "id_partner"                      
[13] "hh_dwt"                           "hh_size"                          "atriskofpoverty"                  "dhh_owned"                       
[17] "dhhtp_c4"                         "disposableincomemonthly"          "equivaliseddisposableincomeyearl" "n_children_0"                    
[21] "n_children_1"                     "n_children_10"                    "n_children_11"                    "n_children_12"                   
[25] "n_children_13"                    "n_children_14"                    "n_children_15"                    "n_children_16"                   
[29] "n_children_17"                    "n_children_2"                     "n_children_3"                     "n_children_4"                    
[33] "n_children_5"                     "n_children_6"                     "n_children_7"                     "n_children_8"                    
[37] "n_children_9"                     "occupancy"                        "region"                           "bu_size"                         
[41] "ydses_c5"                         "adultchildflag"                   "dag"                              "dcpagdf"                         
[45] "dcpen"                            "dcpex"                            "dcpst"                            "dcpyy"                           
[49] "ded"                              "deh_c3"                           "dehf_c3"                          "dehm_c3"                         
[53] "dehsp_c3"                         "der"                              "dgn"                              "dhe"                             
[57] "dhesp"                            "dhm"                              "dlltsd"                           "inversemillsratiomaxfemale"      
[61] "inversemillsratiomaxmale"         "inversemillsratiominfemale"       "inversemillsratiominmale"         "laboursupplyweekly"              
[65] "les_c4"                           "les_c7_covid"                     "lesdf_c4"                         "lessp_c4"                        
[69] "potentialearnings"                "sindex"                           "sindexnormalised"                 "weight"                          
[73] "yearlyequivalisedconsumption"     "ynbcpdf_dv"                       "yplgrs_dv"                        "ypnbihs_dv"                      
[77] "ypncp"                            "ypnoab"                           "yptciihs_dv"                      "scaling_factor"     
dkopasker commented 2 years ago

I see. These outcomes must be constructed based on the rules defined at the top of this page. I have used dummy= for three outcomes to show they are dummy variables. Dummy is not the variable name, these would be out_ghqcase, out_emp, and out_poverty. There is an additional suffix to the variable name to indicate if the outcome relates to the baseline or reform arm of the scenario.

The other outcomes, those that are not dummy variables, have the variable names out_ghq, out_emphrs, and out_income. Again with the suffix attached.

Rules to construct groups were also provided, but not the variable names for the output. All group variable have the prefix grp and should be self-explanatory.

vkhodygo commented 2 years ago

I get it, I'll fix the names tomorrow. Meanwhile, you can take a look at the first version here: 28c8ef3749dc955653e29640ebecf5e6de5bc28f. The path to the data file within the project directory should be data/S1/baseline.csv.gz, however, this part of the code can be edited when needed.

It's a bit messy, no tests, no different scenarios at the moment, but it can produce most of the results and it's performant enough (takes about two minutes to get it done).

dkopasker commented 2 years ago

That seems a good start. More comments in the code would be helpful as I am less familiar wiith R.

vkhodygo commented 2 years ago

More comments in the code would be helpful due as I am less familiar with R.

Noted, that's of great help for everyone since my R skills are rusty as well.

vkhodygo commented 2 years ago

@dkopasker I finally have a version that calculates the metrics for the whole population, modifying it for other cases should not be that difficult. However, I never asked you about the definition of rank.

dkopasker commented 2 years ago

Rank numbers the observations for each outcome from highest to lowest. Equal observations are assigned the average rank. For example, rank_ghq_base orders observations on out_ghq_base from 1 to N. Ranking must be done after collapsing to one observation per run per year. N is the number of runs. The rank will be used to define percentiles relevant to a 95% confidence interval.

vkhodygo commented 2 years ago

Good to know, I'll add this to the code.

See 62f28ba552488b834b4212f8c486c9c7e2864290 for the latest version, no ranks and no effects at the moment, but all groups are now included. This takes about 90 seconds to get done on my machine with the S1 data sample, should be fine to try it on yours.

dkopasker commented 2 years ago

A 115GB file of output from 1,000 runs of 1 arm of the COVID scenario is now stored on the T drive. Compressing the file was predicted to take 11 hours, but copying took only 20 minutes. We can see how long this file takes to run on our laptops, but getting R installed on the "machine up the stairs" is likely to be beneficial.

vkhodygo commented 2 years ago

The latest commit provides rank support now. It is possible for you to run this script on your machine, 16GB of RAM is enough for 50 runs and 1 scenario. I think, even 8GB should do the trick.

The code itself is messy, it'd be great to make it more compact and clear. No proper docs at the moment, but there are comments everywhere.