Open mattnuttall00 opened 5 years ago
Hi @mattnuttall00,
One nice feature of unique
is that it can be applied across multiple dimensions of a data frame. I have added the shortened example you gave into a CSV file, and read it into R below.
dat <- read.csv(file = "Matt_loop_question_short.csv");
So I'm starting off with the below.
Sample_Name Marker Allele_1 Allele_2
1 1 a 215 215
2 1 a 215 215
3 1 a 215 215
4 1 b 102 73
5 1 b 73 73
6 1 b 73 71
7 2 a 143 139
8 2 a 143 139
9 2 a 143 139
10 2 b 139 148
11 2 b 139 148
12 2 b 139 NA
So, if you want all of the unique columns of Sample_Name
, Marker
, and Allele_1
-- then all unique columns of Sample_Name
, Marker
, and Allele_2
, the code below is a quick way to do it.
allele1 <- unique(dat[,1:3]);
allele2 <- unique(dat[,c(1:2, 4)]);
The above gives the following for allele1
.
Sample_Name Marker Allele_1
1 1 a 215
4 1 b 102
5 1 b 73
7 2 a 143
10 2 b 139
And it gives the below for allele2
.
Sample_Name Marker Allele_2
1 1 a 215
4 1 b 73
6 1 b 71
7 2 a 139
10 2 b 148
12 2 b NA
So using your data set, I ran the following code.
sample.name <- unique(dat$Sample_Name);
marker.name <- as.character( unique(dat$Marker) );
allele.name <- unique(dat$Allele_1);
sample.col <- which( colnames(dat) == "Sample_Name");
marker.col <- which( colnames(dat) == "Marker");
allele.col <- which( colnames(dat) == "Allele_1");
unique(dat[,c(sample.col, marker.col, allele.col)]);
I got this output.
Sample_Name Marker Allele_1
1 1 Mar21. 215
4 1 Ggu234 73
5 1 Ggu234 71
7 1 Ma2 139
10 1 Mar08. 148
12 1 Mar08. NA
13 1 Mar64. 183
16 2 Ggu234 73
19 2 Ma2 133
22 2 Mar08. 156
23 2 Mar08. 148
25 2 Mar64. 183
28 2 Mel1 111
30 2 Mel1 NA
31 2 PM_24311 125
33 2 PM_24311 127
The triple loop way of doing it that you used is perfectly reasonable too (it might take a bit more time than the above just because loops are slow in R -- underneath everything, unique
is looping things in C, which is faster). I had some comments about the error message you received, but then realised that I copied down your error message incorrectly (I'm confused as to why the error message states if (dat$Allele_1[i] == dat$Allele_1[i + 1])
, but in your code it appears to be written if (Allele_1[i]==Allele_1[i+1])
-- is this an error?). In any case, I'm happy to show a looping example to do the equivalent of a multi-column call of unique
, if that would be useful?
Hi @bradduthie , thanks for your comprehensive response! It was really useful, and good to know about unique
, and what it can do.
So I think I may not have explained the final step very well. Other people I've spoken to also didn't get what I meant either so totally my fault. But the final step I need to do is: rather than pull out all unique values, once I have got R to look at each 'Marker' within each 'Sample_Name', I need it to look at the three values in column 'Allele_1' (each Allele will have 3 rows within each Marker), and only if they are the same number in all 3 rows, should R pull it out and put the information into the new dataframe (and it only needs to pull out 1 row, so that I know the sample name, marker, and the value). That's why my output dataframe in my original question has only 3 rows - because Sample_Number 1, Marker b does not have matching values in Allele_1, nor does it have matching values in Allele_2, so it gets excluded.
That's what my
if (dat$Allele_1[i]==dat$Allele_1[i+1]){ # if value in row i is the same as other rows
consensus <- dat[,c(2,6,8,9)]
was trying to do. I'm trying to tell R to see if the three values in the column 'Allele_1' (once separated by Sample_Name and Marker) are the same, and if they are, pull out a row. Although I know that code is wrong!
Does that make sense? Apologies if I am explaining it really poorly. Do let me know if I'm not being clear and I will try again!
@mattnuttall00 Ah! Now I see what you mean. I think I have a way that should work -- basically, use the unique
as I did, but then loop through each row of that unique output (e.g., my results for allele1
above), then (using another loop, perhaps -- could be a quicker way) count the number of rows for which the unique values are identical -- if at the end it's >2, then retain these rows.
Sorry, I'll try to give some more meaningful code later, but the idea is this:
allele1 <- unique(dat[,1:3]);
new_vector <- rep(0, length of allele1 rows)
for i in 1 to row number of allele1
for j in 1 in row number of original data table
if(allele1[row, col1] == original data table[row, col1] &
allele1[row, col2] == original data table[row, col2] &
allele1[row, col2] == original data table[row, col2]){
new_vector[i] <- new_vector[i] + 1;
}
}
}
If that makes sense?
How 'bout this! @mattnuttall00
library(tidyverse)
matt <- read_csv("matt.csv") %>% as_tibble() # reading in your data - ive changed the names
matt <- matt %>% select(sn, m, a1, a2) # select relevant cols - again, ive changed names
matt %>% group_by(sn, m, a1) %>% summarise(count = n()) %>% filter(count >= 3) # this returns a df with 1 row for each sn/m/a1 that has 3 matching values
sn m a1 count
1 Ma2 139 3
1 Mar21. 215 3
1 Mar64. 183 3
2 Ggu234 73 3
2 Ma2 133 3
2 Mar64. 183 3
then youd have to do the same again with Allele_2, then join_tables or bind_cols
Hey Matt,
Here is a different take - no for
loop necessary
The only thing I'm not sure about is why the third line of your desired result chose the row where Allete_2 was NA
rather than 148. If the NA is not important, try the first example, otherwise try the other one.
library(dplyr)
df %>%
select(Sample_Name, Marker, Allele_1, Allele_2) %>%
group_by(Sample_Name, Marker) %>%
filter(length(unique(Allele_1)) == 1) %>%
slice(1)
If the NA
is prefered, replace the filter line above with the code below. This example counts unique, non-NA's, sorts them to the top of the group then selects them. The sorting means the NA
will be chosen no matter where it occurs in the subset (I reckon there's probably a cleaner method than this, but I can't see it at the moment).
filter(length(unique(Allele_1)) == 1,
sum(!is.na(unique(Allele_2))) == 1) %>%
arrange(!is.na(Allele_2)) %>%
Another option would be to not report the NA
rows at all using the filter...
filter(length(unique(Allele_1)) == 1,
length(unique(Allele_2)) == 1) %>%
@bradduthie @anna-deasey @jmcvw Amazing - thanks so much to you all for a range of options! I will have a crack at all of them and see which is most appropriate. All approaches much appreciated!
@jmcvw the reason I would want the NA to be selected rather than 148 is because there are only two values of 148, therefore that allele doesn't fit the rule that to be selected there must be 3 matching values. And the structure of that output table was me assuming that entire rows would need to be selected, and because Allele_1 did meet the rule, there would need to be something under Allele_2 (and it couldn't be 148). Hope that makes sense!
Thanks again all :)
Hi all, I am going to attempt to post the first coding question! Apologies in advance if I don't quite nail the "how to post a good question", but please feel free to use this as an opportunity to offer suggestions for future users about how to post a good question. i.e. you can tell me what I could have done to make the question better - you won't hurt my feelings :)
I have a dataframe with ~10,000 observations of 35 variables. I have attached a much smaller subset as an example (attachment at the bottom). The data are outputs from genetic analyses. The structure of the subsetted data looks like this:
What I need to do is to first look at each unique 'Sample_Name', and then within that 'Sample_Name' look at each 'Marker'. And then within each 'Marker' (which are normally 3 rows in length), I want R to look row by row down the column 'Allele_1' and if the values are the same I want it to put one row of data into a new dataframe with some of the same column headers as the original data. I then need R to do the same thing, but for column 'Allele_2'.
So for example, if a portion of the original dataframe looked like this (I've excluded unnecessary cols):
I would want R to create the following dataframe:
I am guessing this needs to be done using a loop (but feel free to suggest an easier way!). I am very much a beginner with loops, but I have had a crack anyway. Surprise surprise, it hasn't worked, and I'm assuming I am wildly wrong, but what I have tried is below. (note: I have only tried to do it for Allele_1)
When I run the above code I get the error:
So I guess I am not getting that bit of code right? But potentially a lot of it isn't right...! Any guidance would most appreciated!
All PCRs Plex1 missing 002 PCR2.xlsx
John: In your example, what these rows? Are they supposed to be removed?