YuLab-SMU / msacolor

4 stars 0 forks source link

divide and conquer #2

Open GuangchuangYu opened 5 years ago

GuangchuangYu commented 5 years ago

Although the function is quite small, you still need to bear in mind that implementing into different modules is far more better if there is something can be divided.

e.g. your source code is not only inefficient, but also hard to maintain as the code to determine the color of the characters is mixing with converting character vector to color vector.

Comparing your code with the following one.

      xx <- apply(seq, 2, table)

      col_convert <- lapply(xx, function(x) {
          y <- rep("white", length(x))
          names(y) <- names(x)
          r <- x/sum(x)
          for (pos in seq_along(x)) {
              char <- names(x)[pos]
              i <- grep(char, col_df$re_position)
              for (j in i) {
                  ## only cosider the situation without , in 're_gp'
                  ## pls modify to consider ','
                  rr <- sum(r[strsplit(col_df$re_gp[j], '')[[1]]], na.rm=T)
                  if (rr > col_df$thred[j]) {
                      y[pos] <- col_df$colour[j]
                      ## a flag to label that this condition is satisfied and no need to test other conditions
                      break
                  }
              }
          }
          return(y)
      })

     seqcolor <- lapply(seq_along(col_convert), function(i) {
          col_convert[[i]][seq[,i]]
      }) %>% do.call('cbind', .)
GuangchuangYu commented 5 years ago

you can move a little bit forward to split the mapping (the following part) into a function.

              i <- grep(char, col_df$re_position)
              for (j in i) {
                  ## only cosider the situation without , in 're_gp'
                  ## pls modify to consider ','
                  rr <- sum(r[strsplit(col_df$re_gp[j], '')[[1]]], na.rm=T)
                  if (rr > col_df$thred[j]) {
                      y[pos] <- col_df$colour[j]
                      ## a flag to label that this condition is satisfied and no need to test other conditions
                      break
                  }
              }
GuangchuangYu commented 5 years ago

the rational is :

huanghuinana commented 5 years ago

Although the function is quite small, you still need to bear in mind that implementing into different modules is far more better if there is something can be divided.

e.g. your source code is not only inefficient, but also hard to maintain as the code to determine the color of the characters is mixing with converting character vector to color vector.

Comparing your code with the following one.

      xx <- apply(seq, 2, table)

      col_convert <- lapply(xx, function(x) {
          y <- rep("white", length(x))
          names(y) <- names(x)
          r <- x/sum(x)
          for (pos in seq_along(x)) {
              char <- names(x)[pos]
              i <- grep(char, col_df$re_position)
              for (j in i) {
                  ## only cosider the situation without , in 're_gp'
                  ## pls modify to consider ','
                  rr <- sum(r[strsplit(col_df$re_gp[j], '')[[1]]], na.rm=T)
                  if (rr > col_df$thred[j]) {
                      y[pos] <- col_df$colour[j]
                      ## a flag to label that this condition is satisfied and no need to test other conditions
                      break
                  }
              }
          }
          return(y)
      })

     seqcolor <- lapply(seq_along(col_convert), function(i) {
          col_convert[[i]][seq[,i]]
      }) %>% do.call('cbind', .)

you can move a little bit forward to split the mapping (the following part) into a function.

              i <- grep(char, col_df$re_position)
              for (j in i) {
                  ## only cosider the situation without , in 're_gp'
                  ## pls modify to consider ','
                  rr <- sum(r[strsplit(col_df$re_gp[j], '')[[1]]], na.rm=T)
                  if (rr > col_df$thred[j]) {
                      y[pos] <- col_df$colour[j]
                      ## a flag to label that this condition is satisfied and no need to test other conditions
                      break
                  }
##consider the situation with , in 're_gp'
rr1<-r[strsplit(col_df$re_gp[j], ',')[[1]]]
if (any(rr1> col_df$thred[j],na.rm = T) ) {
y[pos] <- col_df$colour[j]
break
}
              }
GuangchuangYu commented 5 years ago

any progress?

huanghuinana commented 5 years ago

the code is

  1. determine color based on alignment column (1) calculate character frequency (2)get the matched lines of character in col_df (3)divide the character into two parts based on whether the matched lines are duplicated, which is stored in "xxi" and "dup_aa" respectively (4)assign color to character of "xxi" based from frequency (5)assign color to character of "dup_aa" based from the character in "xxi" shared the same matched lines (6)combine the color of two part characters
  2. map alignment to a color matrix
GuangchuangYu commented 5 years ago
  1. any example of duplication? https://github.com/YuLab-SMU/msacolor/blob/master/color.scheme/R/colorscheme.R#L37-L39

  2. y is repeatedly initialized when iterating xxi, see https://github.com/YuLab-SMU/msacolor/blob/master/color.scheme/R/colorscheme.R#L44.

  3. prevent using variable like, x, y, xi and xxi, unless the code is short and the variable is temporary. Please use variable names that are more informative.

  4. In R terminal, it is handy to use sapply. But in productive code, using vapply is far more safe than sapply, see https://github.com/YuLab-SMU/msacolor/blob/master/color.scheme/R/colorscheme.R#L43.

  5. For unit testing, you should test whether your function can correctly assign color to a known character vector.

GuangchuangYu commented 5 years ago

you can find many vapply example in treeio.

huanghuinana commented 5 years ago
  1. The example of duplicated as followed.
    (1)x is the frenquency of seq (2)re_pos stores the the matched lines of each character of x in col_df. (3)re_pos_unique stores the unique element of re_pos. (4)re_pos_dup stores the remaining elements after removing the unique element of re_pos to re_pos_unique. (5)unique_color stores the color assigned to character based from frequency. (6)dup_color stores the color assigned to character based from the color of character shared the same elements in unique_color.
    
    > seq
      [,1]
    [1,] "A" 
    [2,] "H" 
    [3,] "H" 
    [4,] "W" 
    [5,] "W" 
    [6,] "Y" 
    [7,] "T" 
    [8,] "S" 
    [9,] "M" 
    [10,] "H" 
    > x
    seq
    A H M S T W Y 
    1 3 1 1 1 2 1 
    > re_pos
    $A
    [1] 1

$H [1] 22 23

$M [1] 1

$S [1] 16 17 18

$T [1] 16 17 18

$W [1] 1

$Y [1] 22 23

re_pos_unique $A [1] 1

$H [1] 22 23

$S [1] 16 17 18

re_pos_dup $M [1] 1

$T [1] 16 17 18

$W [1] 1

$Y [1] 22 23

unique_color A H S "blue" "cyan" "green" dup_color M T W Y "blue" "green" "blue" "cyan"


2. The re_gp is a list has so many elements and each elements has different length, so when i used vapply, i found it annoying and i justed used lapply to replace sapply. The latest color.scheme has been updated.
GuangchuangYu commented 5 years ago

that's fine.

Yes, if you expect sapply to return a list, use lapply and if you want sapply to return a vector, use vapply.

GuangchuangYu commented 5 years ago
  1. y is repeatedly initialized when iterating xxi, see https://github.com/YuLab-SMU/msacolor/blob/master/color.scheme/R/colorscheme.R#L44.

The second point I mentioned is still there.

Any reason to add source code of https://github.com/YuLab-SMU/msacolor/blob/master/color.scheme/R/colorscheme.R#L23-L29?

matrix in and matrix out is expected:

> sample(c('A','C','G','T'), 10, replace=T) -> x                 
> seq = matrix(x, ncol=1)                                        
> seq                                                            
      [,1]                                                       
 [1,] "G"                                                        
 [2,] "C"                                                        
 [3,] "A"                                                        
 [4,] "A"                                                        
 [5,] "T"                                                        
 [6,] "A"                                                        
 [7,] "A"                                                        
 [8,] "T"                                                        
 [9,] "C"                                                        
[10,] "A"                                                        
> apply(seq, 2, table)                                           
  [,1]                                                           
A    5                                                           
C    2                                                           
G    1                                                           
T    2                                                           
huanghuinana commented 5 years ago

Comparing the situation about seq with a column and more than a column, and i found that the seqlist should be a list , otherwise after col_convert it would get NULL.

> seq<-testseq
> seq <- toupper(seq)
> seqlist <- apply(seq, 2, table)
> seqlist
$V180

M P S T 
1 1 2 5 

$V181

A D E Q 
2 1 5 1 

$V182

A E T 
2 6 1 

$V183

D E 
2 7 

$V184

H K Q R 
4 3 1 1 

$V185

A D K Q 
3 1 3 2 

$V186

T V 
6 3 
> is.list(seqlist)
[1] TRUE
> sample(c('A','C','G','T'), 10, replace=T) -> x 
> seq = matrix(x, ncol=1) 
> seq
      [,1]
 [1,] "T" 
 [2,] "C" 
 [3,] "A" 
 [4,] "C" 
 [5,] "T" 
 [6,] "G" 
 [7,] "C" 
 [8,] "C" 
 [9,] "T" 
[10,] "G" 
> seqlist <- apply(seq, 2, table) 
> seqlist
  [,1]
A    1
C    4
G    2
T    3
> is.list(seqlist)
[1] FALSE
> col_convert
[[1]]
NULL

[[2]]
NULL

[[3]]
NULL

[[4]]
NULL

Adding the code that has been mentioned makes sure it is a list. The result is shown as followed.

> seqlist
[[1]]
A C G T 
1 4 2 3 

> is.list(seqlist)
[1] TRUE
> col_convert
[[1]]
       A        C        G        T 
 "white"  "white" "orange"  "white" 
huanghuinana commented 5 years ago

The code has been updated.

GuangchuangYu commented 5 years ago

then maybe:

> seq
      [,1]
 [1,] "A"
 [2,] "A"
 [3,] "A"
 [4,] "T"
 [5,] "A"
 [6,] "C"
 [7,] "C"
 [8,] "C"
 [9,] "G"
[10,] "T"
> lapply(as.data.frame(seq), table)
$V1

A C G T
4 3 1 2
huanghuinana commented 5 years ago

ok, i got it.

GuangchuangYu commented 5 years ago

I think you can move on to implement other color schemes.

huanghuinana commented 5 years ago
  1. Other schemes which are similar to one another in form are saved in colorscheme2, the data frames of color schemes based in chemistry, shapely, Taylor and Zappo are saved in colorscheme2/data respectively.
  2. The argument "scheme_df" of function "color_scheme_2" can be selected from different color schemes in colorscheme2/data to match character with color based on different color schemes.
  3. Each color scheme about amino acids has a test dataset named testseq. Their results about color are saved in colorscheme2/data respectively. The unit tests of aboved-mentioned 4 color schemes about amino acids and nucleotides have been saved in 4 files.
  4. The function "color_scheme_2" is : 1> determine color based on non-duplicated characters: 1) transform the data.frame of character into a matrix 2) remain the non-duplicated characters and save them in a vector named seq_unique 3) get the matched lines of character in scheme_df 4) assign color to character of seq_unique based on their properties 2> map alignment to a color matrix
GuangchuangYu commented 5 years ago

No need to separate them in two different pkgs.

I would also expect a union function call to all of these color schemes, e.g.

msa_color_sheme <- function(sequences, scheme) {
    switch(scheme, 
               ...)         
}

PS: you can use named vector to speed up your code, https://github.com/YuLab-SMU/msacolor/blob/master/colorscheme2/R/colorscheme2.R.

A col named vector derived from your chemi_aa_coldf.

> col = rep(chemi_aa_coldf$color, times = nchar(as.character(chemi_aa_coldf$amino_acid)))
> names(col) = unlist(strsplit(as.character(chemi_aa_coldf$amino_acid), ''))

My input is sequence matrix and lookup table col (cached).

> matrix(col[toupper(as.matrix(sampleseq))], ncol = ncol(sampleseq)) -> myres

your input is sequence matrix and lookup table chemi_aa_coldf (cached).

> color_scheme_2(sampleseq,chemi_aa_coldf) -> yourres

We can verify that they output identical result:

> identical(yourres, myres)
[1] TRUE

my code is simple and faster (>5X).

> microbenchmark(
+ your_code = color_scheme_2(sampleseq,chemi_aa_coldf),
+ my_code = matrix(col[toupper(as.matrix(sampleseq))], ncol = ncol(sampleseq)),
+ times = 1000)
Unit: microseconds
      expr      min       lq      mean   median        uq      max neval
 your_code 1280.502 1313.139 1390.6435 1354.128 1368.8655 5884.107  1000
   my_code  230.346  247.429  271.6952  252.868  263.3085 1612.332  1000
huanghuinana commented 5 years ago

all the colorschemes have been emerged in "msacolorscheme"