R4EPI / epidict

Epidemiology data dictionaries and random data generators
https://r4epi.github.io/epidict/
GNU General Public License v3.0
3 stars 1 forks source link

epidict

Lifecycle:
experimental CRAN
status Codecov test
coverage R build
status

The goal of {epidict} is to provide standardized data dictionaries for use in epidemiological data analysis templates. Currently it supports standardised dictionaries from MSF OCA. This is a product of the R4EPIs project; learn more at https://r4epis.netlify.com

Installation

You can install {epidict} from CRAN:

install.packages("epidict")
Click here for alternative installation options If there is a bugfix or feature that is not yet on CRAN, you can install it via the {drat} package: You can also install the in-development version from GitHub using the {remotes} package (but there’s no guarantee that it will be stable): ``` r # install.packages("remotes") remotes::install_github("R4EPI/epidict") ```

Accessing dictionaries

There are four MSF outbreak dictionaries available in {epidict} based on DHIS2 exports:

You can read more about the outbreak dictionaries at https://r4epis.netlify.com/outbreaks

The dictionary can be obtained via the msf_dict() function, which specifies a dictionary that describes recorded variables (data_element_shortname) in rows and their possible options (if categorical):

Click here for code examples ``` r library("epidict") msf_dict("AJS") #> # A tibble: 68 × 8 #> data_element_uid data_elem…¹ data_…² data_…³ data_…⁴ data_…⁵ used_…⁶ options #> #> 1 NA egen_044_e… event_… Is the… TEXT HgdrO8… #> 2 NA egen_001_p… case_n… Anonym… TEXT #> 3 NA egen_004_d… date_o… Date p… DATE #> 4 NA egen_022_d… detect… How pa… TEXT BlfHX5… #> 5 NA egen_005_p… patien… Patien… TEXT YNeOOp… #> 6 NA egen_029_m… msf_in… How ex… TEXT PN5NWt… #> 7 NA egen_008_a… age_ye… Age of… INTEGE… #> 8 NA egen_009_a… age_mo… Age of… INTEGE… #> 9 NA egen_010_a… age_da… Age of… INTEGE… #> 10 NA egen_011_s… sex Sex of… TEXT orgc5Y… #> # … with 58 more rows, and abbreviated variable names ¹​data_element_name, #> # ²​data_element_shortname, ³​data_element_description, #> # ⁴​data_element_valuetype, ⁵​data_element_formname, ⁶​used_optionset_uid #> # ℹ Use `print(n = ...)` to see more rows msf_dict("Cholera") #> # A tibble: 45 × 8 #> data_element_uid data_elem…¹ data_…² data_…³ data_…⁴ data_…⁵ used_…⁶ options #> #> 1 AafTlSwliVQ egen_001_p… case_n… Anonym… TEXT Case n… #> 2 OTGOtWBz39J egen_004_d… date_o… Date p… DATE Date o… #> 3 wnmMr2V3T3u egen_006_p… patien… Locati… ORGANI… Patien… #> 4 sbgqjeVwtb8 egen_008_a… age_ye… Age of… INTEGE… Age in… #> 5 eXYhovYyl61 egen_009_a… age_mo… Age of… INTEGE… Age in… #> 6 UrYJSk2Wp46 egen_010_a… age_da… Age of… INTEGE… Age in… #> 7 D1Ky5K7pFN6 egen_011_s… sex Sex of… TEXT Sex orgc5Y… #> 8 dTm5R53YYXC egen_012_p… pregna… Pregna… TEXT Pregna… IEjzG2… #> 9 FF7d81Zy0yQ egen_013_p… trimes… If pre… TEXT Trimes… QjGHFN… #> 10 vLAmA6Pmjip egen_014_p… foetus… If pre… TEXT Foetus… SR8Jtf… #> # … with 35 more rows, and abbreviated variable names ¹​data_element_name, #> # ²​data_element_shortname, ³​data_element_description, #> # ⁴​data_element_valuetype, ⁵​data_element_formname, ⁶​used_optionset_uid #> # ℹ Use `print(n = ...)` to see more rows msf_dict("Measles") #> # A tibble: 52 × 8 #> data_element_uid data_elem…¹ data_…² data_…³ data_…⁴ data_…⁵ used_…⁶ options #> #> 1 DE_EGEN_001 egen_001_p… case_n… Anonym… TEXT Case n… #> 2 DE_EGEN_004 egen_004_d… date_o… Date p… DATE Date o… #> 3 DE_EGEN_005 egen_005_p… patien… Patien… TEXT Patien… YNeOOp… #> 4 DE_EGEN_006 egen_006_p… patien… Locati… ORGANI… Patien… #> 5 DE_EGEN_008 egen_008_a… age_ye… Age of… INTEGE… Age in… #> 6 DE_EGEN_009 egen_009_a… age_mo… Age of… INTEGE… Age in… #> 7 DE_EGEN_010 egen_010_a… age_da… Age of… INTEGE… Age in… #> 8 DE_EGEN_011 egen_011_s… sex Sex of… TEXT Sex orgc5Y… #> 9 DE_EGEN_012 egen_012_p… pregna… Pregna… TEXT Pregna… IEjzG2… #> 10 DE_EGEN_013 egen_013_p… trimes… If pre… TEXT Trimes… QjGHFN… #> # … with 42 more rows, and abbreviated variable names ¹​data_element_name, #> # ²​data_element_shortname, ³​data_element_description, #> # ⁴​data_element_valuetype, ⁵​data_element_formname, ⁶​used_optionset_uid #> # ℹ Use `print(n = ...)` to see more rows msf_dict("Meningitis") #> # A tibble: 53 × 8 #> data_element_uid data_elem…¹ data_…² data_…³ data_…⁴ data_…⁵ used_…⁶ options #> #> 1 AafTlSwliVQ egen_001_p… case_n… Anonym… TEXT Case n… #> 2 OTGOtWBz39J egen_004_d… date_o… Date p… DATE Date o… #> 3 udXAcFEE1dl egen_005_p… patien… Patien… TEXT Patien… YNeOOp… #> 4 wnmMr2V3T3u egen_006_p… patien… Locati… ORGANI… Patien… #> 5 sbgqjeVwtb8 egen_008_a… age_ye… Age of… INTEGE… Age in… #> 6 eXYhovYyl61 egen_009_a… age_mo… Age of… INTEGE… Age in… #> 7 UrYJSk2Wp46 egen_010_a… age_da… Age of… INTEGE… Age in… #> 8 D1Ky5K7pFN6 egen_011_s… sex Sex of… TEXT Sex orgc5Y… #> 9 ADfNqpCL5kf egen_015_e… exit_s… Final … TEXT Exit s… hO9TET… #> 10 JZ8yqTow79G egen_016_d… date_o… Date p… DATE Exit d… #> # … with 43 more rows, and abbreviated variable names ¹​data_element_name, #> # ²​data_element_shortname, ³​data_element_description, #> # ⁴​data_element_valuetype, ⁵​data_element_formname, ⁶​used_optionset_uid #> # ℹ Use `print(n = ...)` to see more rows ```

In addition, there are four MSF survey dictionaries available:

You can read more about the survey dictionaries at https://r4epis.netlify.com/surveys

These are accessible via msf_dict_survey() where the variables are in name. You can also read in your own Kobo (ODK) dictionaries by specifying tempalte = FALSE and then setting name = <path to your .xlsx>.

Click here for code examples ``` r msf_dict_survey("Mortality") #> # A tibble: 174 × 15 #> type name short…¹ label…² label…³ hint_…⁴ hint_…⁵ default relev…⁶ appea…⁷ #> #> 1 start start start Start … #> 2 end end end End Ti… #> 3 today today today Date o… #> 4 device… devi… device… Phone … #> 5 date date Date o… Date Date today() #> 6 integer team… Team n… Team n… Numéro… numbers #> 7 village vill… Villag… Villag… Nom du… #> 8 text vill… Other … Specif… Autre,… ${vill… #> 9 integer clus… Cluste… Cluste… Numéro… numbers #> 10 integer hous… Househ… Househ… Numéro… numbers #> # … with 164 more rows, 5 more variables: constraint , repeat_count , #> # calculation , value_type , options , and abbreviated #> # variable names ¹​short_name, ²​label_english, ³​label_french, ⁴​hint_english, #> # ⁵​hint_french, ⁶​relevant, ⁷​appearance #> # ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names msf_dict_survey("Nutrition") #> # A tibble: 27 × 15 #> type name short…¹ label…² label…³ hint_…⁴ hint_…⁵ repea…⁶ relev…⁷ calcu…⁸ #> #> 1 start start Start … NA #> 2 end end End Ti… NA #> 3 today today Date o… NA #> 4 device… devi… Phone … NA #> 5 date date Date Date NA #> 6 integer team… Team n… Team n… NA #> 7 village vill… Villag… Villag… Nom du… NA #> 8 text vill… Other … Specif… Précis… ${vill… NA #> 9 geopoi… vill… Villag… Villag… Locali… NA #> 10 integer clus… Cluste… Cluste… Numéro… NA #> # … with 17 more rows, 5 more variables: constraint , appearance , #> # default , value_type , options , and abbreviated variable #> # names ¹​short_name, ²​label_english, ³​label_french, ⁴​hint_english, #> # ⁵​hint_french, ⁶​repeat_count, ⁷​relevant, ⁸​calculation #> # ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names msf_dict_survey("Vaccination_long") #> # A tibble: 106 × 15 #> type name short…¹ label…² label…³ hint_…⁴ hint_…⁵ default relev…⁶ appea…⁷ #> #> 1 start start Start … Start … #> 2 end end End Ti… End Ti… #> 3 today today Date o… Date o… #> 4 device… devi… Phone … Phone … #> 5 date date Date Date Date today() #> 6 integer team… Team n… Team n… Numéro… #> 7 village vill… Villag… Villag… Nom du… #> 8 text vill… Other … Specif… Veuill… ${vill… #> 9 integer clus… Cluste… Cluste… Numéro… numbers #> 10 integer hous… Househ… Househ… Numéro… #> # … with 96 more rows, 5 more variables: repeat_count , constraint , #> # calculation , value_type , options , and abbreviated #> # variable names ¹​short_name, ²​label_english, ³​label_french, ⁴​hint_english, #> # ⁵​hint_french, ⁶​relevant, ⁷​appearance #> # ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names msf_dict_survey("Vaccination_short") #> # A tibble: 38 × 16 #> type name short…¹ label…² label…³ hint_…⁴ hint_…⁵ default relev…⁶ appea…⁷ #> #> 1 start start Start … Start … #> 2 end end End Ti… End Ti… #> 3 today today Date o… Date o… #> 4 device… devi… Phone … Phone … #> 5 date date Date Date Date .today… #> 6 integer team… Team n… Team n… Numéro… #> 7 village vill… Villag… Villag… Nom du… #> 8 text vill… Other … Specif… Veuill… ${vill… #> 9 integer clus… Cluste… Cluste… Numéro… numbers #> 10 integer hous… Househ… Househ… Numéro… #> # … with 28 more rows, 6 more variables: repeat_count , constraint , #> # calculation , hxl , value_type , options , and #> # abbreviated variable names ¹​short_name, ²​label_english, ³​label_french, #> # ⁴​hint_english, ⁵​hint_french, ⁶​relevant, ⁷​appearance #> # ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names ```

Generating data

The {epidict} package has a function for generating data that’s called gen_data(), which takes three arguments: The dictionary, which column describes the variable names, and how many rows are needed in the output.

Click here for code examples ``` r gen_data("Measles", varnames = "data_element_shortname", numcases = 100, org = "MSF") #> # A tibble: 100 × 52 #> case_number date_of_c…¹ patie…² patie…³ age_y…⁴ age_m…⁵ age_d…⁶ sex pregn…⁷ #> #> 1 A1 2018-04-05 IP Villag… 38 NA NA M NA #> 2 A2 2018-01-07 OP Villag… 77 NA NA M NA #> 3 A3 2018-03-18 IP Villag… 13 NA NA F NA #> 4 A4 2018-02-16 IP Villag… 28 NA NA F Y #> 5 A5 2018-03-25 IP Villag… NA NA 9 U NA #> 6 A6 2018-03-16 IP Villag… 24 NA NA F Y #> 7 A7 2018-04-09 OP Villag… 86 NA NA M NA #> 8 A8 2018-04-08 IP Villag… 8 NA NA U NA #> 9 A9 2018-03-23 OP Villag… 60 NA NA M NA #> 10 A10 2018-04-23 OP Villag… 21 NA NA U NA #> # … with 90 more rows, 43 more variables: trimester , #> # foetus_alive_at_admission , exit_status , date_of_exit , #> # time_to_death , pregnancy_outcome_at_exit , #> # baby_born_with_complications , previously_vaccinated , #> # previous_vaccine_doses_received , detected_by , #> # msf_involvement , residential_status , #> # residential_status_brief , date_of_last_vaccination , … #> # ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names gen_data("Vaccination_long", varnames = "name", numcases = 100, org = "MSF") #> # A tibble: 100 × 120 #> start end today deviceid date team_…¹ villa…² villa…³ clust…⁴ house…⁵ #> #> 1 NA NA NA NA 2018-02-18 NA villag… NA 1 1 #> 2 NA NA NA NA 2018-02-13 NA villag… NA 7 2 #> 3 NA NA NA NA 2018-02-21 NA villag… NA 6 1 #> 4 NA NA NA NA 2018-02-24 NA other NA 11 1 #> 5 NA NA NA NA 2018-03-21 NA villag… NA 3 1 #> 6 NA NA NA NA 2018-04-06 NA villag… NA 7 1 #> 7 NA NA NA NA 2018-02-09 NA villag… NA 3 2 #> 8 NA NA NA NA 2018-04-29 NA villag… NA 4 3 #> 9 NA NA NA NA 2018-04-15 NA villag… NA 8 1 #> 10 NA NA NA NA 2018-03-26 NA villag… NA 8 3 #> # … with 90 more rows, 110 more variables: households_building , #> # random_hh , consent , no_consent_reason , #> # no_consent_other , caretaker_relation , caretaker_other , #> # number_children , child_number , sex , date_birth , #> # age_years , age_months , any_vaccine , vaccine_card , #> # hf_records , health_facility , date_records_checked , #> # injection_upper_arm , scar_present , poliodrop_woc , … #> # ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names ```

Cleaning data with the dictionaries

You can use the dictionaries to clean the data via the {matchmaker} package:

Click here for code examples ``` r library("matchmaker") library("dplyr") dat <- gen_data(dictionary = "Cholera", varnames = "data_element_shortname", numcases = 20, org = "MSF" ) print(dat) #> # A tibble: 20 × 45 #> case_number date_of_c…¹ patie…² age_y…³ age_m…⁴ age_d…⁵ sex pregn…⁶ trime…⁷ #> #> 1 A1 2018-03-08 Villag… 8 NA NA M NA #> 2 A2 2018-04-09 Villag… 9 NA NA U NA #> 3 A3 2018-02-21 Villag… 8 NA NA M NA #> 4 A4 2018-04-24 Villag… 3 NA NA U NA #> 5 A5 2018-04-07 Villag… 52 NA NA M NA #> 6 A6 2018-01-30 Villag… 24 NA NA M NA #> 7 A7 2018-02-21 Villag… 57 NA NA U NA #> 8 A8 2018-01-10 Villag… 12 NA NA F NA #> 9 A9 2018-01-11 Villag… NA NA 3 U NA #> 10 A10 2018-04-02 Villag… 7 NA NA F Y 2 #> 11 A11 2018-04-10 Villag… 12 NA NA F W #> 12 A12 2018-02-19 Villag… 23 NA NA M NA #> 13 A13 2018-02-12 Villag… NA 8 NA F N #> 14 A14 2018-01-31 Villag… 66 NA NA U NA #> 15 A15 2018-01-11 Villag… 7 NA NA U NA #> 16 A16 2018-04-22 Villag… 25 NA NA M NA #> 17 A17 2018-01-10 Villag… 27 NA NA F NA #> 18 A18 2018-01-18 Villag… 16 NA NA F N #> 19 A19 2018-03-05 Villag… 9 NA NA F NA #> 20 A20 2018-02-13 Villag… 8 NA NA M NA #> # … with 36 more variables: foetus_alive_at_admission , exit_status , #> # date_of_exit , time_to_death , pregnancy_outcome_at_exit , #> # previously_vaccinated , previous_vaccine_doses_received , #> # readmission , msf_involvement , #> # cholera_treatment_facility_type , residential_status_brief , #> # date_of_last_vaccination , prescribed_zinc_supplement , #> # prescribed_antibiotics , ors_consumed_litres , … #> # ℹ Use `colnames()` to see all variable names # We want the expanded dictionary, so we will select `compact = FALSE` dict <- msf_dict(disease = "Cholera", long = TRUE, compact = FALSE, tibble = TRUE ) print(dict) #> # A tibble: 182 × 11 #> data_elemen…¹ data_…² data_…³ data_…⁴ data_…⁵ data_…⁶ used_…⁷ optio…⁸ optio…⁹ #> #> 1 AafTlSwliVQ egen_0… case_n… Anonym… TEXT Case n… #> 2 OTGOtWBz39J egen_0… date_o… Date p… DATE Date o… #> 3 wnmMr2V3T3u egen_0… patien… Locati… ORGANI… Patien… #> 4 sbgqjeVwtb8 egen_0… age_ye… Age of… INTEGE… Age in… #> 5 eXYhovYyl61 egen_0… age_mo… Age of… INTEGE… Age in… #> 6 UrYJSk2Wp46 egen_0… age_da… Age of… INTEGE… Age in… #> 7 D1Ky5K7pFN6 egen_0… sex Sex of… TEXT Sex orgc5Y… M Male #> 8 D1Ky5K7pFN6 egen_0… sex Sex of… TEXT Sex orgc5Y… F Female #> 9 D1Ky5K7pFN6 egen_0… sex Sex of… TEXT Sex orgc5Y… U Unknow… #> 10 dTm5R53YYXC egen_0… pregna… Pregna… TEXT Pregna… IEjzG2… N Not cu… #> # … with 172 more rows, 2 more variables: option_uid , #> # option_order_in_set , and abbreviated variable names #> # ¹​data_element_uid, ²​data_element_name, ³​data_element_shortname, #> # ⁴​data_element_description, ⁵​data_element_valuetype, ⁶​data_element_formname, #> # ⁷​used_optionset_uid, ⁸​option_code, ⁹​option_name #> # ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names # Now we can use matchmaker to filter the data dat_clean <- matchmaker::match_df(dat, dict, from = "option_code", to = "option_name", by = "data_element_shortname", order = "option_order_in_set" ) print(dat_clean) #> # A tibble: 20 × 45 #> case_number date_of_c…¹ patie…² age_y…³ age_m…⁴ age_d…⁵ sex pregn…⁶ trime…⁷ #> #> 1 A1 2018-03-08 Villag… 8 NA NA Male Not ap… #> 2 A2 2018-04-09 Villag… 9 NA NA Unkn… Not ap… #> 3 A3 2018-02-21 Villag… 8 NA NA Male Not ap… #> 4 A4 2018-04-24 Villag… 3 NA NA Unkn… Not ap… #> 5 A5 2018-04-07 Villag… 52 NA NA Male Not ap… #> 6 A6 2018-01-30 Villag… 24 NA NA Male Not ap… #> 7 A7 2018-02-21 Villag… 57 NA NA Unkn… Not ap… #> 8 A8 2018-01-10 Villag… 12 NA NA Fema… Not ap… #> 9 A9 2018-01-11 Villag… NA NA 3 Unkn… Not ap… #> 10 A10 2018-04-02 Villag… 7 NA NA Fema… Yes, c… 2nd tr… #> 11 A11 2018-04-10 Villag… 12 NA NA Fema… Was pr… #> 12 A12 2018-02-19 Villag… 23 NA NA Male Not ap… #> 13 A13 2018-02-12 Villag… NA 8 NA Fema… Not cu… #> 14 A14 2018-01-31 Villag… 66 NA NA Unkn… Not ap… #> 15 A15 2018-01-11 Villag… 7 NA NA Unkn… Not ap… #> 16 A16 2018-04-22 Villag… 25 NA NA Male Not ap… #> 17 A17 2018-01-10 Villag… 27 NA NA Fema… Not ap… #> 18 A18 2018-01-18 Villag… 16 NA NA Fema… Not cu… #> 19 A19 2018-03-05 Villag… 9 NA NA Fema… Not ap… #> 20 A20 2018-02-13 Villag… 8 NA NA Male Not ap… #> # … with 36 more variables: foetus_alive_at_admission , exit_status , #> # date_of_exit , time_to_death , pregnancy_outcome_at_exit , #> # previously_vaccinated , previous_vaccine_doses_received , #> # readmission , msf_involvement , #> # cholera_treatment_facility_type , residential_status_brief , #> # date_of_last_vaccination , prescribed_zinc_supplement , #> # prescribed_antibiotics , ors_consumed_litres , … #> # ℹ Use `colnames()` to see all variable names ```