RockefellerUniversity / Intro_To_R_1Day

Introduction to R training course
https://rockefelleruniversity.github.io/Intro_To_R_1Day/
3 stars 2 forks source link

Question about the by argument in the merge function #15

Open imarin79 opened 4 years ago

imarin79 commented 4 years ago

Hi Matt,

Thanks again for the great presentation last Friday. I am currently doing the exercises of factors and data frames. I have a question about the "by" function. Specifically in the question "Create a data frame containing only those gene names common to all data frames with all information from Annotation and the expression from Sample 1 and Sample 2", I do not quite understand the meaning of by.x=2 and by.y=1. Does it refer to the number of columns to merge between sample 1 and 2? Which columns are those? Many thanks

ThomasCarroll commented 4 years ago

hi Isaac,

The by argument in the merge function specifies the column by which to match the two data.frames when merging.

> expressDF <-  data.frame(Genes=c("PTBP1","PTBP2","PTBP3"),
+                          expression=c(10,100,200))
> 
> lenDF <-  data.frame(Genes=c("PTBP1","PTBP2","PTBP3"),
+                          length=c(10000,1020,200000))
> 
> 
> merge(expressDF,lenDF,by=1)
  Genes expression length
1 PTBP1         10  10000
2 PTBP2        100   1020
3 PTBP3        200 200000
> 
> merge(expressDF,lenDF,by="Genes")
  Genes expression length
1 PTBP1         10  10000
2 PTBP2        100   1020
3 PTBP3        200 200000
> 

Where we want to match the data.frames by different column positions or names we specify the by.x and by.y for columns we wish to use for matching in the first and second data.frame.

> expressDF <-  data.frame(Genes=c("PTBP1","PTBP2","PTBP3"),
+                          expression=c(10,100,200))
> 
> lenDF <-  data.frame(IDS=c("ID121","ID122","ID123"),
+                      Symbols=c("PTBP1","PTBP2","PTBP3"),
+                          length=c(10000,1020,200000))
> 
> 
> merge(expressDF,lenDF,by.x=1,by.y=2)
  Genes expression   IDS length
1 PTBP1         10 ID121  10000
2 PTBP2        100 ID122   1020
3 PTBP3        200 ID123 200000
> 
> merge(expressDF,lenDF,by.x="Genes",by.y="Symbols")
  Genes expression   IDS length
1 PTBP1         10 ID121  10000
2 PTBP2        100 ID122   1020
3 PTBP3        200 ID123 200000