Open GuangchuangYu opened 5 years ago
you can move a little bit forward to split the mapping (the following part) into a function.
i <- grep(char, col_df$re_position)
for (j in i) {
## only cosider the situation without , in 're_gp'
## pls modify to consider ','
rr <- sum(r[strsplit(col_df$re_gp[j], '')[[1]]], na.rm=T)
if (rr > col_df$thred[j]) {
y[pos] <- col_df$colour[j]
## a flag to label that this condition is satisfied and no need to test other conditions
break
}
}
the rational is :
Although the function is quite small, you still need to bear in mind that implementing into different modules is far more better if there is something can be divided.
e.g. your source code is not only inefficient, but also hard to maintain as the code to determine the color of the characters is mixing with converting character vector to color vector.
Comparing your code with the following one.
xx <- apply(seq, 2, table) col_convert <- lapply(xx, function(x) { y <- rep("white", length(x)) names(y) <- names(x) r <- x/sum(x) for (pos in seq_along(x)) { char <- names(x)[pos] i <- grep(char, col_df$re_position) for (j in i) { ## only cosider the situation without , in 're_gp' ## pls modify to consider ',' rr <- sum(r[strsplit(col_df$re_gp[j], '')[[1]]], na.rm=T) if (rr > col_df$thred[j]) { y[pos] <- col_df$colour[j] ## a flag to label that this condition is satisfied and no need to test other conditions break } } } return(y) }) seqcolor <- lapply(seq_along(col_convert), function(i) { col_convert[[i]][seq[,i]] }) %>% do.call('cbind', .)
you can move a little bit forward to split the mapping (the following part) into a function.
i <- grep(char, col_df$re_position) for (j in i) { ## only cosider the situation without , in 're_gp' ## pls modify to consider ',' rr <- sum(r[strsplit(col_df$re_gp[j], '')[[1]]], na.rm=T) if (rr > col_df$thred[j]) { y[pos] <- col_df$colour[j] ## a flag to label that this condition is satisfied and no need to test other conditions break } ##consider the situation with , in 're_gp' rr1<-r[strsplit(col_df$re_gp[j], ',')[[1]]] if (any(rr1> col_df$thred[j],na.rm = T) ) { y[pos] <- col_df$colour[j] break } }
any progress?
the code is
any example of duplication? https://github.com/YuLab-SMU/msacolor/blob/master/color.scheme/R/colorscheme.R#L37-L39
y
is repeatedly initialized when iterating xxi
, see https://github.com/YuLab-SMU/msacolor/blob/master/color.scheme/R/colorscheme.R#L44.
prevent using variable like, x
, y
, xi
and xxi
, unless the code is short and the variable is temporary. Please use variable names that are more informative.
In R terminal, it is handy to use sapply
. But in productive code, using vapply
is far more safe than sapply
, see https://github.com/YuLab-SMU/msacolor/blob/master/color.scheme/R/colorscheme.R#L43.
For unit testing, you should test whether your function can correctly assign color to a known character vector.
you can find many vapply
example in treeio.
> seq
[,1]
[1,] "A"
[2,] "H"
[3,] "H"
[4,] "W"
[5,] "W"
[6,] "Y"
[7,] "T"
[8,] "S"
[9,] "M"
[10,] "H"
> x
seq
A H M S T W Y
1 3 1 1 1 2 1
> re_pos
$A
[1] 1
$H [1] 22 23
$M [1] 1
$S [1] 16 17 18
$T [1] 16 17 18
$W [1] 1
$Y [1] 22 23
re_pos_unique $A [1] 1
$H [1] 22 23
$S [1] 16 17 18
re_pos_dup $M [1] 1
$T [1] 16 17 18
$W [1] 1
$Y [1] 22 23
unique_color A H S "blue" "cyan" "green" dup_color M T W Y "blue" "green" "blue" "cyan"
2. The re_gp is a list has so many elements and each elements has different length, so when i used vapply, i found it annoying and i justed used lapply to replace sapply. The latest color.scheme has been updated.
that's fine.
Yes, if you expect sapply
to return a list, use lapply
and if you want sapply
to return a vector, use vapply
.
y
is repeatedly initialized when iteratingxxi
, see https://github.com/YuLab-SMU/msacolor/blob/master/color.scheme/R/colorscheme.R#L44.
The second point I mentioned is still there.
Any reason to add source code of https://github.com/YuLab-SMU/msacolor/blob/master/color.scheme/R/colorscheme.R#L23-L29?
matrix in and matrix out is expected:
> sample(c('A','C','G','T'), 10, replace=T) -> x
> seq = matrix(x, ncol=1)
> seq
[,1]
[1,] "G"
[2,] "C"
[3,] "A"
[4,] "A"
[5,] "T"
[6,] "A"
[7,] "A"
[8,] "T"
[9,] "C"
[10,] "A"
> apply(seq, 2, table)
[,1]
A 5
C 2
G 1
T 2
Comparing the situation about seq with a column and more than a column, and i found that the seqlist should be a list , otherwise after col_convert it would get NULL.
> seq<-testseq
> seq <- toupper(seq)
> seqlist <- apply(seq, 2, table)
> seqlist
$V180
M P S T
1 1 2 5
$V181
A D E Q
2 1 5 1
$V182
A E T
2 6 1
$V183
D E
2 7
$V184
H K Q R
4 3 1 1
$V185
A D K Q
3 1 3 2
$V186
T V
6 3
> is.list(seqlist)
[1] TRUE
> sample(c('A','C','G','T'), 10, replace=T) -> x
> seq = matrix(x, ncol=1)
> seq
[,1]
[1,] "T"
[2,] "C"
[3,] "A"
[4,] "C"
[5,] "T"
[6,] "G"
[7,] "C"
[8,] "C"
[9,] "T"
[10,] "G"
> seqlist <- apply(seq, 2, table)
> seqlist
[,1]
A 1
C 4
G 2
T 3
> is.list(seqlist)
[1] FALSE
> col_convert
[[1]]
NULL
[[2]]
NULL
[[3]]
NULL
[[4]]
NULL
Adding the code that has been mentioned makes sure it is a list. The result is shown as followed.
> seqlist
[[1]]
A C G T
1 4 2 3
> is.list(seqlist)
[1] TRUE
> col_convert
[[1]]
A C G T
"white" "white" "orange" "white"
The code has been updated.
then maybe:
> seq
[,1]
[1,] "A"
[2,] "A"
[3,] "A"
[4,] "T"
[5,] "A"
[6,] "C"
[7,] "C"
[8,] "C"
[9,] "G"
[10,] "T"
> lapply(as.data.frame(seq), table)
$V1
A C G T
4 3 1 2
ok, i got it.
I think you can move on to implement other color schemes.
No need to separate them in two different pkgs.
I would also expect a union function call to all of these color schemes, e.g.
msa_color_sheme <- function(sequences, scheme) {
switch(scheme,
...)
}
PS: you can use named vector to speed up your code, https://github.com/YuLab-SMU/msacolor/blob/master/colorscheme2/R/colorscheme2.R.
A col
named vector derived from your chemi_aa_coldf
.
> col = rep(chemi_aa_coldf$color, times = nchar(as.character(chemi_aa_coldf$amino_acid)))
> names(col) = unlist(strsplit(as.character(chemi_aa_coldf$amino_acid), ''))
My input is sequence matrix and lookup table col
(cached).
> matrix(col[toupper(as.matrix(sampleseq))], ncol = ncol(sampleseq)) -> myres
your input is sequence matrix and lookup table chemi_aa_coldf
(cached).
> color_scheme_2(sampleseq,chemi_aa_coldf) -> yourres
We can verify that they output identical result:
> identical(yourres, myres)
[1] TRUE
my code is simple and faster (>5X).
> microbenchmark(
+ your_code = color_scheme_2(sampleseq,chemi_aa_coldf),
+ my_code = matrix(col[toupper(as.matrix(sampleseq))], ncol = ncol(sampleseq)),
+ times = 1000)
Unit: microseconds
expr min lq mean median uq max neval
your_code 1280.502 1313.139 1390.6435 1354.128 1368.8655 5884.107 1000
my_code 230.346 247.429 271.6952 252.868 263.3085 1612.332 1000
all the colorschemes have been emerged in "msacolorscheme"
Although the function is quite small, you still need to bear in mind that implementing into different modules is far more better if there is something can be divided.
e.g. your source code is not only inefficient, but also hard to maintain as the code to determine the color of the characters is mixing with converting character vector to color vector.
Comparing your code with the following one.