StirlingCodingClub / studyGroup

Gather together a group to skill-share, co-work, and create community
http://StirlingCodingClub.github.io/studyGroup/
Other
2 stars 1 forks source link

Question: User defined function for multiple plotting #14

Open mattnuttall00 opened 5 years ago

mattnuttall00 commented 5 years ago

Hi SCC team,

I have been trying to figure out how to write a function which will streamline the creation of multiple histograms. The idea of course being to save me time, but I can't quite figure out how to do it and therefore I am in fact wasting time going round in circles. SCC to the rescue! Hopefully this will be a useful post for other folk who would like to dapple in UDF's

I have a fairly large data set of socioeconomic variables from Cambodia. There are 42 variables and 1621 observations (each observation is the value for a given variable in each Commune, which is just an administrative area). There is also a variable called 'Province' which is a larger administrative area. The structure is below

> str(dat_master)
Classes ‘grouped_df’, ‘tbl_df’, ‘tbl’ and 'data.frame': 1621 obs. of  42 variables:
 $ CommCode       : Factor w/ 1621 levels "10101","10102",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ Province       : Factor w/ 24 levels "Banteay Meanchey",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ Commune        : Factor w/ 1438 levels "achar Leak","Aekakpheap",..: 706 651 77 88 152 513 537 677 760 971 ...
 $ tot_pop        : int  92143 18830 22007 6966 16619 7930 7528 14256 10374 11666 ...
 $ family         : int  16963 3688 3682 1476 3325 1381 1485 2692 2305 2314 ...
 $ male_18_60     : int  41801 8627 9610 3083 7398 3176 3925 6776 5147 5372 ...
 $ fem_18_60      : int  41156 8228 9117 2993 7148 3183 3774 6454 4862 5133 ...
 $ pop_over61     : int  2456 974 1552 472 1273 470 453 893 725 837 ...
 $ numPrimLivFarm : num  1072 3048 2829 1419 3203 ...
 $ Fish_man       : num  2 4 0 2 0 20 73 3 11 0 ...
 $ ntfp_fam       : num  3 0 0 0 0 0 0 0 5 1 ...
 $ land_confl     : int  221 23 10 1 14 2 9 16 13 8 ...
 $ Pax_migt_in    : int  4786 160 54 18 129 0 60 72 25 7 ...
 $ Pax_migt_out   : int  482 275 19 61 73 3 13 30 15 19 ...
 $ F6_24_sch      : num  0.714 0.613 0.61 0.667 0.676 ...
 $ M6_24_sch      : num  0.668 0.569 0.627 0.693 0.674 ...
 $ F18_60_ill     : num  0.1795 0.1884 0.0453 0.0583 0.1256 ...
 $ M18_60_ill     : num  0.1604 0.1619 0.0478 0.0285 0.12 ...
 $ propPrimLivFarm: num  0.195 0.8 0.822 0.969 0.969 ...
 $ fam_prod       : num  0.000522 0.003486 0 0.007574 0.001103 ...
 $ Cloth_craft    : num  0 0 0 0 0 ...
 $ Trader         : num  0.05869 0.01786 0.00557 0 0.00154 ...
 $ serv_prov      : num  0.23967 0.00349 0.00662 0.0021 0.00331 ...
 $ T18_60_uncjob  : num  0.7513 0.6358 0.622 0.0279 0.0968 ...
 $ Les1_R_Land    : num  0.0142 0.2015 0.3101 0.1355 0.1154 ...
 $ No_R_Land      : num  0.0928 0.3063 0.1712 0.1435 0.2359 ...
 $ Les1_F_Land    : num  0.0226 0.00599 0.03018 0.00433 0.02071 ...
 $ No_F_Land      : num  0.03421 0.00599 0.00403 0.00433 0.01476 ...
 $ cow_fam        : num  0.021 0.1472 0.2089 0.0648 0.148 ...
 $ pig_fam        : num  0.0252 0.1806 0.2483 0.1361 0.1685 ...
 $ garbage        : num  0.046 0 0 0 0 ...
 $ KM_Market      : num  4.7 3.83 3.89 4.68 6.75 ...
 $ KM_Comm        : num  5.95 5.93 4.05 1.84 3.87 ...
 $ YR_Pp_well     : num  157.45 10.47 2.21 1 22.21 ...
 $ wat_safe       : num  0.6467 0.3384 0.0607 0.1022 0.1192 ...
 $ wat_pipe       : num  0.169 0.251 0.119 0.108 0.179 ...
 $ crim_case      : num  0.000193 0.000333 0 0.000184 0.000548 ...
 $ KM_Heal_cent   : num  2.74 5.8 3.92 2.23 4.64 ...
 $ inf_mort       : num  1.93e-04 0.00 4.96e-05 0.00 5.47e-05 ...
 $ U5_mort        : num  2.18e-04 0.00 0.00 0.00 3.63e-05 ...
 $ Prop_Indigenous: num  0 0 0 0 0 0 0 0 0 0 ...
 $ dist_sch       : num  1.5 5 3 1.5 3.5 3 1 2.15 2 2 ...

I have only recently started working on this data set, and so am in the process of doing some data exploration. To do this, I am wanting to do a lot of plotting, for example histograms, to identify outliers, errors etc. What I wanted to do first was to plot histograms of each variable at the 'Province' level, as I would be able to spot any unusual records. Now I know I can do it this way:

## Subset data by Province ####
battambang <- dat_master %>% 
  filter(., Province == "Battambang") 
banteay_meanchey <- dat_master %>% 
  filter(., Province == "Banteay Meanchey")
kampong_speu <- dat_master %>% 
  filter(., Province == "Kampong Speu")
kep <- dat_master %>% 
  filter(., Province == "Kep")
etc
etc

## histograms for total population by province
tp1 <- qplot(battambang$tot_pop, geom = "histogram")
tp2 <- qplot(banteay_meanchey$tot_pop, geom = "histogram")
tp3 <- qplot(kampong_speu$tot_pop, geom = "histogram")
tp4 <- qplot(kep$tot_pop, geom = "histogram")
tp5 <- qplot(otdar_meanchey$tot_pop, geom = "histogram")
tp6 <- qplot(preah_vihear$tot_pop, geom = "histogram")
tp7 <- qplot(siem_reap$tot_pop, geom = "histogram")
tp8 <- qplot(kampong_thom$tot_pop, geom = "histogram")
tp9 <- qplot(koh_kong$tot_pop, geom = "histogram")
tp10 <- qplot(pailin$tot_pop, geom = "histogram")
etc
etc

plot_grid(tp1,tp2,tp3,tp4,tp5,tp6,tp7,tp8,tp9,tp10,tp11,tp12,tp13,tp14,tp15,tp16,tp17,tp18,tp19,tp20,tp21,tp22,tp23,tp24)

But to do this for 42 variables seems silly. I am sure a function can be written, so that I could simply call

plot_function(variable = tot_pop)

and histograms for that variable for each province would be printed.

I have tried a few ways, but I am not getting very far, and seeing as I've never attempted a UDF before I though maybe best get some guidance! Below are some of the ways I've tried:

histplot <- function(x) {
  p1 <- qplot(battambang$x, geom = "histogram")
  p2 <- qplot(banteay_meanchey$x, geom = "histogram")
  p3 <- qplot(kampong_speu$x, geom = "histogram")

 q <- plot_grid(p1,p2,p3)
 print(q)
}
histplot(x = tot_pop)

and

histplot <- function(prov,varx){

 t <- dat_master %>% 
       filter(Province==prov) %>% 
       select(varx)
 plotx <- qplot(t$varx, geom="histogram")
print(plotx)
}

histplot(prov = "Battambang", varx = "tot_pop")

and

histplot1 <- function(x){
  nm <- levels(dat_master$Province)
  for (i in seq_along(nm)){
    print(ggplot(x, aes_string(x = nm[i])) + geom_histogram())
  }
}
histplot1(x=battambang)

The method above is using the new dataframes created when subsetting by Province. E.g. in histplot1(x=battambang) - battambang is a dataframe.

Unfortunately none of those ways are working. Any ideas would be most appreciated!

anna-deasey commented 5 years ago

@mattnuttall00 duuuuuuuuuude you need ggplot2 and facet_wrap or facet_grid

ggplot(data = dat_master, aes(tot_pop)) + geom_histogram() + facet_wrap(province ~ . )

mattnuttall00 commented 5 years ago

Oh. my. days.

I did try facet_grid right at the start and that didn't do what I needed, but I did not try facet_wrap.

That worked perfectly, just a note though: I got an error when I did facet_wrap(Province ~ .), but facet_wrap(~Province) did exactly what I need.

Thank you Anna!

Now please excuse me while I go and repeatedly bang my head against a wall

anna-deasey commented 5 years ago

boom!! facet_wrap the ass out of that dataset!

mattnuttall00 commented 5 years ago

Not sure whether I should be closing this issue or not. I guess it's worth leaving coding questions open so that other people can easily search for and find them in the future?

bradduthie commented 5 years ago

Probably best to leave open, at least until the number of issues gets to be out of control.