Hierarchy populating columns

mlperickson commented 6 years ago

I'd like to learn how to automate creating one set of fields from several sets of fields. For example, we have data for trace elements (arsenic in the example) that could be present in any one of 5 groups of fields. The 5 possible fields denote filtered/unfiltered/unknown whether filtered, with either known or unknown units. I would like to create one set of Arsenic fields that are populated from the 'best to worst' from the original fields, with the 'new combined result' consisting of 4 fields: original p-code, sample date, remark code, and value.

See the attached small data file with an example of the original fields and the group of 'combined' fields that I would like. Thank you!

As example.xlsx

lindsayplatt commented 6 years ago

@mlperickson having a little trouble following what you mean here. I think you want to look into tidyr::gather, but I'm not quite sure yet. Can you post some screen shots of the "before"/"after" data you are looking for. It can just be screenshots of the excel file you are working with.

mlperickson commented 6 years ago

tidyr::gather may work for my need -- but I'm not quite sure. Perhaps we can walk through Wednesday? I rearranged my example spreadsheet and attached. In the first example spreadsheet, the "before" would be columns A-G and L-Z. The "after" would be columns A-K. I have a specific 'best to worst' in mind for the sets of columns in L-Z, so I would like the new set of columns H-K to populate in a certain order, and I don't want a 'better' value to be overwritten with a 'worse' value -- but if the 'better' columns are blank, then I DO want the 'worse' value. In my example, the 'best to worst' order would be the groups of columns as follows: Best is Arsenic, filtered, micrograms per liter; then Arsenic, unfiltered, micrograms per liter, then Arsenic, unknown, micrograms per liter; then Arsenic, unfiltered, unknown units; then Arsenic, unknown, unknown units is worst. Hopefully my query makes more sense now. Thanks!! As example before and after.xlsx

lindsayplatt commented 6 years ago

We will walk through this today, but I accomplished this by a combination of tidyr + dplyr steps. My first step, however, was to remove the merged column headers - those are not nice to play with in R, so watch out when you have those in your data. You can see my adjustments to the way the columns are named in the sheet called Before.edited.

As.example.before.and.after_lcarredit.xlsx

# Load sheet with fixed columns (don't use merged columns)
# I added the pcode as the prefix to each merged column
arsenic_data <- readxl::read_excel('As.example.before.and.after_lcarredit.xlsx', sheet = 3)
names(arsenic_data)
[1] "Data Source"      "CycleText"        "PrincipalAquifer" "NetworkType"      "SuCode"          
 [6] "State"            "STAID"            "PN0015_Date"      "PN0015_Remark"    "PN0015_Value"    
[11] "PN0017_Date"      "PN0017_Remark"    "PN0017_Value"     "PN0018_Date"      "PN0018_Remark"   
[16] "PN0018_Value"     "PN0019_Date"      "PN0019_Remark"    "PN0019_Value"     "PN0020_Date"     
[21] "PN0020_Remark"    "PN0020_Value"

# Loading dplyr full library so I can get `%>%`
# Left tidyr:: and dplyr:: so you can see which function is from which package
library(dplyr)

# Create one column for each pcode group of columns
arsenic_data_united <- arsenic_data %>% 
  tidyr::unite("PN0015", dplyr::contains("PN0015"), sep="__") %>% 
  tidyr::unite("PN0017", dplyr::contains("PN0017"), sep="__") %>% 
  tidyr::unite("PN0018", dplyr::contains("PN0018"), sep="__") %>% 
  tidyr::unite("PN0019", dplyr::contains("PN0019"), sep="__") %>% 
  tidyr::unite("PN0020", dplyr::contains("PN0020"), sep="__")

# Gather values + pcodes
arsenic_data_gathered <- arsenic_data_united %>% 
  tidyr::gather(pcode, value, -c(1:7))

# Separate back out in date, remark, and value
arsenic_data_separated <- arsenic_data_gathered %>% 
  tidyr::separate(value, into = c("Date", "Remark", "Value"), 
                  sep = "__", convert = TRUE)

# Now add criteria for determining which is the best available for that site
arsenic_data_best <- arsenic_data_separated %>% 
  filter(!is.na(Value)) %>% # no need to look at anything that is missing a value
  group_by_at(1:7) %>% # group by the site info so
  # adding a new column of pcodes based on a hierarchy defined in ifelse statements
  mutate( # ifelse+any on a grouped df will look at all rows in that group at once
    best_pcode = ifelse(any(pcode == "PN0015"), yes = "PN0015",
                        no = ifelse(any(pcode == "PN0017"), yes = "PN0017",
                                    no = ifelse(any(pcode == "PN0018"), yes = "PN0018",
                                                no = ifelse(any(pcode == "PN0019"), yes = "PN0019",
                                                            no = ifelse(any(pcode == "PN0020"), yes = "PN0020",
                                                                        no = NA)))))) 

# Now, filter out the data where the pcode is not the "best" pcode available
arsenic_data_combed <- arsenic_data_best %>% 
  ungroup() %>% 
  filter(pcode == best_pcode) %>% 
  select(-best_pcode) # this is now a duplicate of the reg pcode column

head(as.data.frame(arsenic_data_combed))
  Data Source   CycleText                             PrincipalAquifer NetworkType  SuCode State
1        GWAM Cycle Unkwn Sand and gravel aquifers (glaciated regions)       Unkwn sdgvpau    SD
2        GWAM Cycle Unkwn Sand and gravel aquifers (glaciated regions)       Unkwn sdgvpau    IN
3        GWAM Cycle Unkwn Sand and gravel aquifers (glaciated regions)       Unkwn sdgvpau    MN
4        GWAM Cycle Unkwn Sand and gravel aquifers (glaciated regions)       Unkwn sdgvpau    MN
5        GWAM Cycle Unkwn Sand and gravel aquifers (glaciated regions)       Unkwn sdgvpau    SD
6        GWAM Cycle Unkwn Sand and gravel aquifers (glaciated regions)       Unkwn sdgvpau    IN
              STAID  pcode       Date Remark Value
1   sd126N76W16DDCD PN0015 2007-08-07   <NA> 38.00
2         in01688NC PN0015 2011-06-30   <NA>  6.00
3          mn639311 PN0015 2010-05-25   <NA>  2.18
4          mn639314 PN0015 2007-06-25   <NA>  1.36
5 sd126N76W16DDCD 1 PN0015 2007-08-07      <  1.01
6         in02202RS PN0015 2011-05-12      <  1.00

lindsayplatt commented 6 years ago

Covered in the 7/25 session.

USGS-R / hypeRusgs

Hierarchy populating columns #6