FarrellDay / miceRanger

miceRanger: Fast Imputation with Random Forests in R
Other
67 stars 12 forks source link

Unicode characters in data column names throw an error in naWhere #15

Open drag05 opened 2 years ago

drag05 commented 2 years ago

I have the following data

> head(htc, 2)
      25 µL      50 µL     75 µL    100 µL  Accession
1: 1.265836 0.02575365 0.1428066 0.2107820 A0A024R6I7
2:       NA 0.01566025 0.1481060 0.2069585 A0A075B6K4

> dim(htc)
[1] 269   5

> htc[, colSums(is.na(.SD))]
    25 µL     50 µL     75 µL    100 µL Accession 
      200         0         3         0         0 

associated with these naWhere , varp and varn

> naWhere[1:4, ]
     25 µL 50 µL 75 µL 100 µL Accession
[1,] FALSE FALSE FALSE  FALSE     FALSE
[2,]  TRUE FALSE FALSE  FALSE     FALSE
[3,]  TRUE FALSE FALSE  FALSE     FALSE

> dim(naWhere)
[1] 269   5

> colSums(naWhere)
    25 µL     50 µL     75 µL    100 µL Accession 
      200         0         3         0         0 

> varp <- unique(unlist(vars))
> varp
[1] "50 μL"     "75 μL"     "100 μL"    "Accession" "25 μL"   ## maybe apply gtools::mixedsort ?

> varn
[1] "25 μL" "75 μL"

Calculating the leftout columns, throws the following error:

leftOut <- !varp %in% varn & colSums(naWhere[, varp]) > 0

"Error in naWhere[, varp] : subscript out of bounds"

Checking varp against colnames(naWhere):

identical(varp, colnames(naWhere))
FALSE

> intersect(varp, colnames(naWhere))
[1] "Accession"

> varp %in% colnames(naWhere)
[1] FALSE FALSE FALSE  TRUE FALSE

> which(varp %in% colnames(naWhere)) ## "Accession" only (FALSE)
[1] 4
> which(colnames(naWhere) %in% varp) ## "Accession" only (FALSE)
[1] 5

It seems to still be working when comparing varp against varn:

> !varp %in% varn
[1]  TRUE FALSE  TRUE  TRUE FALSE

The error seems to be caused by the presence of unicode characters in names although it seems to be no challenge for varp and varn , as shown by the last code line above. However,

using either seq_along or base::enc2native functions seems to remove the error:

leftOut <- !varp %in% varn & colSums(naWhere[, seq(along=varp)]) > 0

> leftOut
    25 µL     50 µL     75 µL    100 µL Accession 
     TRUE     FALSE      TRUE     FALSE     FALSE 

> varp = enc2native(varp)
> leftOut <- !varp %in% varn & colSums(naWhere[, varp]) > 0
> leftOut
    50 µL     75 µL    100 µL Accession     25 µL 
    FALSE      TRUE     FALSE     FALSE      TRUE 

Please advise, thank you!

samFarrellDay commented 2 years ago

Would you mind sending me the data? I'll probably implement the seq_along fix if everything else works as intended. I foresee several areas that will need to be fixed to handle unicode characters.

drag05 commented 2 years ago

@samFarrellDay I am not proprietary of the data but I could make an artificial set and post it here. So far I have found out that Unicode also impacts the diagnostic plots.

It would be really useful for documents and Shiny. Otherwise, column names could be changed for the purpose of imputation and then, changed back to Unicode for presentation although working in Unicode throughout would save the overhead.

drag05 commented 2 years ago

@samFarrellDay The script below generates a data.table with missing values and Unicode characters. One observation: Unicode characters can be converted/visualized only if they are defined inside data.table environment.

# generate a data table containing NA values
require(data.table)
L = 1000L
x = list(a = sample(c(runif(L, -1L, 1L), rep(NA, L)), L)
       , b = sample(c(rnorm(L, 1L, 3L), rep(NA, L %/% 2L)), L) 
       , c = sample(rep(1:2, each = 2L), L, replace = TRUE))
dt = as.data.table(x)

# convert column "c" to Unicode characters
dt[, c := ifelse(c == 1L, '25 \u03BCL', '50 \u03BCL')]

# rename dt
setnames(dt, c('Treat \u03B1', 'Treat \u03B2', 'Sample')) 

> dt
          Treat a     Treat ß Sample
   1:          NA          NA  50 µL
   2:          NA          NA  50 µL
   3: -0.86576094        1.12  50 µL
   4:          NA          NA  50 µL

# obs: names(dt) reads Greek "alpha" ('\u03B1') as Latin character "a" 

The script converted "c" vector from list x to Unicode inside data.table. If I had done this in list x and then converted the list to "data.table", as.data.table would have not read the characters as Unicode. Example:

# alternative

# Unicode Greek letters
greek = c('\u03B1', '\u03B2', '\u03B3', '\u03B4', '\u03B5', '\u03B6', '\u03B7', '\u03B8', '\u03B9',
           '\u03BA', '\u03BB', '\u03BC', '\u03BD', '\u03BE', '\u03BF', '\u03C0', '\u03C1', '\u03C3',
           '\u03C4', '\u03C5', '\u03C6', '\u03C7', '\u03C8', '\u03C9')

# generate list with missing values and Unicode characters
L = 20L
x = list(
         a = sample(c(runif(L, -1L, 1L), rep(NA, L)), L)
       , b = sample(c(rnorm(L, 1L, 3L), rep(NA, L %/% 2L)), L) 
       , c = sample(
                     c(replicate(L, paste0(sample(c(greek, letters, 1:9)
                                         ,  size = 4L, replace = TRUE), collapse = ''))
                                , rep(NA, times = L)) , size = L)
  )

# convert list to data.table
dt = as.data.table(x)

> dt
               a          b                   c
 1:  0.706300090 -0.2082637                <NA>
 2:           NA -1.4747307                <NA>
 3:           NA         NA      <U+03B7>o4<U+03C1>9  <--- not read as Greek letters!
 4: -0.855431452 -0.8188787                <NA>
 5: -0.443747398  2.7301625                <NA>
 6:           NA         NA               2twzz

Thank you!