StirlingCodingClub / studyGroup

Gather together a group to skill-share, co-work, and create community
http://StirlingCodingClub.github.io/studyGroup/
Other
2 stars 1 forks source link

Question: how to use a loop to compare and extract data from dataframe #11

Open mattnuttall00 opened 5 years ago

mattnuttall00 commented 5 years ago

Hi all, I am going to attempt to post the first coding question! Apologies in advance if I don't quite nail the "how to post a good question", but please feel free to use this as an opportunity to offer suggestions for future users about how to post a good question. i.e. you can tell me what I could have done to make the question better - you won't hurt my feelings :)

I have a dataframe with ~10,000 observations of 35 variables. I have attached a much smaller subset as an example (attachment at the bottom). The data are outputs from genetic analyses. The structure of the subsetted data looks like this:

> str(dat)
'data.frame':   33 obs. of  35 variables:
 $ Sample_File : Factor w/ 2 levels "A01_1.fsa","B01_2.fsa": 1 1 1 1 1 1 1 1 1 1 ...
 $ Sample_Name : int  1 1 1 1 1 1 1 1 1 1 ...
 $ PCR         : Factor w/ 3 levels "PCR1","PCR2",..: 1 2 3 1 2 3 1 2 3 1 ...
 $ Run_Name    : Factor w/ 3 levels "Kez_PM_Plex1PCR1_A_001_2018-08-02",..: 1 2 3 1 2 3 1 2 3 1 ...
 $ Panel       : Factor w/ 1 level "Plex1-final": 1 1 1 1 1 1 1 1 1 1 ...
 $ Marker      : Factor w/ 7 levels "Ggu234","Ma2",..: 4 4 4 1 1 1 2 2 2 3 ...
 $ Dye         : Factor w/ 2 levels "B","G": 2 2 2 2 2 2 2 2 2 1 ...
 $ Allele_1    : int  215 215 215 73 71 71 139 139 139 148 ...
 $ Allele_2    : int  215 215 215 102 73 73 143 143 143 156 ...
 $ Size_1      : num  215.2 215.5 214.9 73.8 71.2 ...
 $ Size_2      : num  215 215 215 103 74 ...
 $ Height_1    : int  30617 29533 29633 13345 18997 17313 28923 26329 26827 29673 ...
 $ Height_2    : int  30617 29533 29633 1321 30163 30419 23950 26032 23219 28703 ...
 $ Peak_Area.1 : int  131010 141687 148406 49024 79300 66837 114264 123179 126678 121743 ...
 $ Peak_Area.2 : int  131010 141687 148406 4336 190909 175520 74183 101655 97147 97453 ...
 $ Data_Point.1: int  2735 2799 2711 1885 1903 1851 2271 2316 2248 2318 ...
 $ Data_Point.2: int  2735 2799 2711 2051 1919 1866 2296 2343 2274 2366 ...
 $ ADO         : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ AE          : logi  FALSE FALSE FALSE FALSE TRUE FALSE ...
 $ OS          : int  0 0 0 0 -4 0 0 -4 0 0 ...
 $ SHP         : int  0 0 0 0 -4 0 0 -4 0 0 ...
 $ OBA         : int  0 0 0 0 -4 0 0 -4 0 0 ...
 $ SPA         : int  0 0 0 0 -1 0 0 -4 0 0 ...
 $ SP          : int  0 0 0 0 -4 0 0 -4 0 0 ...
 $ BIN         : int  0 0 0 0 -1 0 0 -1 0 0 ...
 $ PHR         : int  -2 -2 -2 1 -1 0 0 -4 0 0 ...
 $ LPH         : int  0 0 0 0 -4 0 0 -4 0 0 ...
 $ SPU         : int  0 0 0 0 -4 0 0 -4 0 0 ...
 $ AN          : int  0 0 0 0 -1 0 0 -4 0 0 ...
 $ BD          : int  0 0 0 0 -1 0 0 -4 0 0 ...
 $ XTLK        : int  0 0 0 0 -4 0 0 -4 0 0 ...
 $ GQ          : num  0.771 0.771 0.771 0.487 1 0.771 0.771 1 0.771 0.771 ...
 $ UD1         : Factor w/ 1 level "Plex1": 1 1 1 1 1 1 1 1 1 1 ...
 $ UD2         : Factor w/ 2 levels "","PCR3": 1 1 2 1 1 2 1 1 2 1 ...
 $ UD3         : Factor w/ 3 levels "","PCR2","reseq": 1 2 3 1 2 3 1 2 3 1 ...

What I need to do is to first look at each unique 'Sample_Name', and then within that 'Sample_Name' look at each 'Marker'. And then within each 'Marker' (which are normally 3 rows in length), I want R to look row by row down the column 'Allele_1' and if the values are the same I want it to put one row of data into a new dataframe with some of the same column headers as the original data. I then need R to do the same thing, but for column 'Allele_2'.

So for example, if a portion of the original dataframe looked like this (I've excluded unnecessary cols):

Sample_Name Marker Allele_1 Allele_2
          1      a      215      215
          1      a      215      215
          1      a      215      215
          1      b      102       73
          1      b       73       73
          1      b       73       71
          2      a      143      139
          2      a      143      139
          2      a      143      139
          2      b      139      148
          2      b      139      148
          2      b      139       NA

I would want R to create the following dataframe:

Sample_Name Marker Allele_1 Allele_2
         1      a      215      215
         2      a      143      139
         2      b      139       NA

I am guessing this needs to be done using a loop (but feel free to suggest an easier way!). I am very much a beginner with loops, but I have had a crack anyway. Surprise surprise, it hasn't worked, and I'm assuming I am wildly wrong, but what I have tried is below. (note: I have only tried to do it for Allele_1)

sample.name <- unique(dat$Sample_Name) # Identify unique sample names
marker.name <- unique(dat$Marker) # Identify unique maker names
consensus <- data.frame(Sample_Name=integer(),  # create new dataframe
                        Marker=character(), 
                        Allele_1=integer(), 
                        Allele_2=integer(), 
                        Consensus=integer())

for (s in 1:length(sample.name)) {     # for each sample name
  sub.name <- dat[dat$Sample_Name==sample.name[s],]  # subset first by unique sample name 
  for (m in 1:length(sub.name)) {      # then go through the subsetted data row by row
    sub.marker <- sub.name[sub.name$Marker==marker.name[m],] # and subset again by unique marker name
      for (i in nrow(sub.marker)) {                          # go row by row in subsetted marker data
        if (Allele_1[i]==Allele_1[i+1]){                # if value in row i is the same as other rows
          consensus <- dat[,c(2,6,8,9)]               # then put that row (for selected columns) into new df
        }
      } 
  }
}

When I run the above code I get the error:

Error in if (dat$Allele_1[i] == dat$Allele_1[i + 1]) { : 
  argument is of length zero

So I guess I am not getting that bit of code right? But potentially a lot of it isn't right...! Any guidance would most appreciated!
All PCRs Plex1 missing 002 PCR2.xlsx

John: In your example, what these rows? Are they supposed to be removed?

1      b      102       73
1      b       73       73
2      b      139      148
bradduthie commented 5 years ago

Hi @mattnuttall00,

One nice feature of unique is that it can be applied across multiple dimensions of a data frame. I have added the shortened example you gave into a CSV file, and read it into R below.

dat     <- read.csv(file = "Matt_loop_question_short.csv");

So I'm starting off with the below.

   Sample_Name Marker Allele_1 Allele_2
1            1      a      215      215
2            1      a      215      215
3            1      a      215      215
4            1      b      102       73
5            1      b       73       73
6            1      b       73       71
7            2      a      143      139
8            2      a      143      139
9            2      a      143      139
10           2      b      139      148
11           2      b      139      148
12           2      b      139       NA

So, if you want all of the unique columns of Sample_Name, Marker, and Allele_1 -- then all unique columns of Sample_Name, Marker, and Allele_2, the code below is a quick way to do it.

allele1 <- unique(dat[,1:3]);
allele2 <- unique(dat[,c(1:2, 4)]);

The above gives the following for allele1.

   Sample_Name Marker Allele_1
1            1      a      215
4            1      b      102
5            1      b       73
7            2      a      143
10           2      b      139

And it gives the below for allele2.

   Sample_Name Marker Allele_2
1            1      a      215
4            1      b       73
6            1      b       71
7            2      a      139
10           2      b      148
12           2      b       NA

So using your data set, I ran the following code.

sample.name <- unique(dat$Sample_Name);
marker.name <- as.character( unique(dat$Marker) );
allele.name <- unique(dat$Allele_1);
sample.col  <- which( colnames(dat) == "Sample_Name");
marker.col  <- which( colnames(dat) == "Marker");
allele.col  <- which( colnames(dat) == "Allele_1");
unique(dat[,c(sample.col, marker.col, allele.col)]);

I got this output.

  Sample_Name   Marker Allele_1
1            1   Mar21.      215
4            1   Ggu234       73
5            1   Ggu234       71
7            1      Ma2      139
10           1   Mar08.      148
12           1   Mar08.       NA
13           1   Mar64.      183
16           2   Ggu234       73
19           2      Ma2      133
22           2   Mar08.      156
23           2   Mar08.      148
25           2   Mar64.      183
28           2     Mel1      111
30           2     Mel1       NA
31           2 PM_24311      125
33           2 PM_24311      127

The triple loop way of doing it that you used is perfectly reasonable too (it might take a bit more time than the above just because loops are slow in R -- underneath everything, unique is looping things in C, which is faster). I had some comments about the error message you received, but then realised that I copied down your error message incorrectly (I'm confused as to why the error message states if (dat$Allele_1[i] == dat$Allele_1[i + 1]), but in your code it appears to be written if (Allele_1[i]==Allele_1[i+1]) -- is this an error?). In any case, I'm happy to show a looping example to do the equivalent of a multi-column call of unique, if that would be useful?

mattnuttall00 commented 5 years ago

Hi @bradduthie , thanks for your comprehensive response! It was really useful, and good to know about unique, and what it can do.

So I think I may not have explained the final step very well. Other people I've spoken to also didn't get what I meant either so totally my fault. But the final step I need to do is: rather than pull out all unique values, once I have got R to look at each 'Marker' within each 'Sample_Name', I need it to look at the three values in column 'Allele_1' (each Allele will have 3 rows within each Marker), and only if they are the same number in all 3 rows, should R pull it out and put the information into the new dataframe (and it only needs to pull out 1 row, so that I know the sample name, marker, and the value). That's why my output dataframe in my original question has only 3 rows - because Sample_Number 1, Marker b does not have matching values in Allele_1, nor does it have matching values in Allele_2, so it gets excluded.

That's what my

if (dat$Allele_1[i]==dat$Allele_1[i+1]){                # if value in row i is the same as other rows
          consensus <- dat[,c(2,6,8,9)]

was trying to do. I'm trying to tell R to see if the three values in the column 'Allele_1' (once separated by Sample_Name and Marker) are the same, and if they are, pull out a row. Although I know that code is wrong!

Does that make sense? Apologies if I am explaining it really poorly. Do let me know if I'm not being clear and I will try again!

bradduthie commented 5 years ago

@mattnuttall00 Ah! Now I see what you mean. I think I have a way that should work -- basically, use the unique as I did, but then loop through each row of that unique output (e.g., my results for allele1 above), then (using another loop, perhaps -- could be a quicker way) count the number of rows for which the unique values are identical -- if at the end it's >2, then retain these rows.

Sorry, I'll try to give some more meaningful code later, but the idea is this:

allele1         <- unique(dat[,1:3]);
new_vector <- rep(0, length of allele1 rows)  
for i in 1 to row number of allele1
    for j in 1 in row number of original data table
        if(allele1[row, col1] == original data table[row, col1] &
           allele1[row, col2] == original data table[row, col2] &
           allele1[row, col2] == original data table[row, col2]){
           new_vector[i] <- new_vector[i] + 1;
          }
     }
}

If that makes sense?

anna-deasey commented 5 years ago

How 'bout this! @mattnuttall00

library(tidyverse)

matt <- read_csv("matt.csv") %>% as_tibble() # reading in your data - ive changed the names

matt <- matt %>% select(sn, m, a1, a2) # select relevant cols - again, ive changed names

matt %>% group_by(sn, m, a1) %>% summarise(count = n()) %>% filter(count >= 3) # this returns a df with 1 row for each sn/m/a1 that has 3 matching values

sn m a1 count
1   Ma2 139 3   
1   Mar21.  215 3   
1   Mar64.  183 3   
2   Ggu234  73  3   
2   Ma2 133 3   
2   Mar64.  183 3   

then youd have to do the same again with Allele_2, then join_tables or bind_cols

jmcvw commented 5 years ago

Hey Matt,

Here is a different take - no for loop necessary

The only thing I'm not sure about is why the third line of your desired result chose the row where Allete_2 was NA rather than 148. If the NA is not important, try the first example, otherwise try the other one.

library(dplyr)

df %>% 
    select(Sample_Name, Marker, Allele_1, Allele_2) %>%
    group_by(Sample_Name, Marker) %>%
    filter(length(unique(Allele_1)) == 1) %>% 
    slice(1)

If the NA is prefered, replace the filter line above with the code below. This example counts unique, non-NA's, sorts them to the top of the group then selects them. The sorting means the NA will be chosen no matter where it occurs in the subset (I reckon there's probably a cleaner method than this, but I can't see it at the moment).

    filter(length(unique(Allele_1)) == 1,
           sum(!is.na(unique(Allele_2))) == 1) %>%
    arrange(!is.na(Allele_2)) %>% 

Another option would be to not report the NA rows at all using the filter...

    filter(length(unique(Allele_1)) == 1, 
        length(unique(Allele_2)) == 1) %>%
mattnuttall00 commented 5 years ago

@bradduthie @anna-deasey @jmcvw Amazing - thanks so much to you all for a range of options! I will have a crack at all of them and see which is most appropriate. All approaches much appreciated!

@jmcvw the reason I would want the NA to be selected rather than 148 is because there are only two values of 148, therefore that allele doesn't fit the rule that to be selected there must be 3 matching values. And the structure of that output table was me assuming that entire rows would need to be selected, and because Allele_1 did meet the rule, there would need to be something under Allele_2 (and it couldn't be 148). Hope that makes sense!

Thanks again all :)