Chris1221 / coRge

Evaluation of Simultaneous Inference Methods for the Human Genome.
http://chrisbcole.me/coRge/
Apache License 2.0
0 stars 0 forks source link

Failed attempts at speeding up reading (i.e. things which are not faster than fread). #13

Closed Chris1221 closed 8 years ago

Chris1221 commented 8 years ago

Attempts at speeding up the reading and converting of .gen files.

This thread exists as a warning and reminder to myself of how truly awful I am at programming.

Chris1221 commented 8 years ago

Reading in as a sqldf does not work. Slower at least by a factor of 10.


        for(k in 1:5){
          if(k == 1){
            f = file(paste0(path, "chr1_block_", i, "_perm_", j, "_k_", k, ".controls.gen"), h = F, sep = " "))

            sqldf("select * from f", dbname = tempfile(), file.format = list(header = T, row.names = F)) -> gen
          } else if(k != 1){

            f = file(paste0(path, "chr1_block_", i, "_perm_", j, "_k_", k, ".controls.gen"), h = F, sep = " "))

        sqldf("select * from f", dbname = tempfile(), file.format = list(header = T, row.names = F)) %>% data.table::merge(gen, ., by = "V1:V5") %>% cbind(gen, .) -> gen
          }
        }
Chris1221 commented 8 years ago

Read lines was at least 100 times slower.

inputFile <- "../inst/extdata/toy.gen"

system.time({
con  <- file(inputFile, open = "r")

out <- data.table(ID = 1:1000)

while (length(oneLine <- readLines(con, n = 1, warn = FALSE)) > 0) {
  myVector <- (strsplit(oneLine, " "))
  myVector <- as.vector(as.factor(unlist(myVector)))

  foreach(row = 1:nrow(gen)) %:% foreach(i = seq(6,((length(myVector)-2)),by=3), .combine = c) %do% {

    myVector <- gen[row,]

    j <- i + 1
    h <- i + 2

    one <- myVector[i]
    two <- myVector[j]
    three <- myVector[h]

    final <- NA

    if (one > 0.9) {
      final <- 0
    } else if (two > 0.9) {
      final <- 1
    } else if (three > 0.9) {
      final <- 2
    } else {
      final <- NA
    }

    final

  }

  out[, myVector[3] := vec, with = FALSE] -> out
  message(paste0(ncol(out)))

}
Chris1221 commented 8 years ago

The above was also slower when

library(doParallel)
makeCluster(8)

Then %dopar%.

Chris1221 commented 8 years ago

Reading lines one at a time with coRge::gen2R was really bad.

Chris1221 commented 8 years ago

Chaining rows together with %:% was equally disastrous. Don't go down this path.

Chris1221 commented 8 years ago

foreach with .combine = 'rbind()' and .combine = 'c' was just insanely slow.

Chris1221 commented 8 years ago

Issue #19 might do it