PeteHaitch / cometh

An R package with tools for analysing, managing and visualising co-methylation data. Loosely speaking, co-methylation is the correlation structure of DNA methylation.
0 stars 1 forks source link

Faster read.comethylation #18

Closed PeteHaitch closed 10 years ago

PeteHaitch commented 10 years ago

Try replacing scan with data.table::fread in read.comethylation. If there is a significant speed-up then either import the data.table package, or write a bespoke function that imitates fread for cometh.tsv files.

PeteHaitch commented 10 years ago

A scan approach:

file <- "~/Desktop/E18BUF.CG.1.tsv.gz"

if (grepl("\\.gz$", file)){
  con <- gzfile(file)
} else if (grepl("\\.bz2$", file)){
  con <- bzfile(file)
} else {
  con <- file(file)
}

# Construct header and column types. The first column is a character, the rest are int
column_headers <- strsplit(readLines(con, n = 1), '\t')[[1]]
what0 <- replicate(length(column_headers), character(0))
names(what0) <- column_headers
what0[-1] <- replicate(length(column_headers) - 1, integer(0))

x <- scan(con, skip = 1, what = what0, sep = "\t", quote = "", na.strings = "NA", quiet = FALSE, comment.char = "")
close(con)

Timing:

#Read 27278373 records
#   user  system elapsed 
# 59.127   0.773  60.162
PeteHaitch commented 10 years ago

A gunzip + data.table::fread approach:

system(command = "gunzip -d -c ~/Desktop/E18BUF.CG.1.tsv.gz > ~/Desktop/E18BUF.CG.1.tsv")
y <- fread(input = "~/Desktop/E18BUF.CG.1.tsv", 
           sep = '\t', 
           header = TRUE,
           verbose = TRUE,
           showProgress = TRUE)
unlink("~/Desktop/E18BUF.CG.1.tsv")

Timing:

# Input contains no \n. Taking this to be a filename to open
# File opened, filesize is 0.514 GB
# File is opened and mapped ok
# Detected eol as \r\n (CRLF) in that order, the Windows standard.
# Looking for supplied sep '\t' on line 30 (the last non blank line in the first 'autostart') ... found ok
# Found 4 columns
# First row with 4 fields occurs on line 1 (either column names or first row of data)
# 'header' changed by user from 'auto' to TRUE
# Count of eol after first data row: 27278374
# Subtracted 1 for last eol and any trailing empty lines, leaving 27278373 data rows
# Type codes: 4111 (first 5 rows)
# Type codes: 4111 (+middle 5 rows)
# Type codes: 4111 (+last 5 rows)
# Type codes: 4111 (after applying colClasses and integer64)
# Type codes: 4111 (after applying drop or select (if supplied)
# Allocating 4 column slots (4 - 0 NULL)
# Read 27278373 rows and 4 (of 4) columns from 0.514 GB file in 00:00:06
#    0.060s (  1%) Memory map (rerun may be quicker)
#    0.000s (  0%) sep and header detection
#    1.460s ( 26%) Count rows (wc -l)
#    0.002s (  0%) Column type detection (first, middle and last 5 rows)
#    0.172s (  3%) Allocation of 27278373x4 result (xMB) in RAM
#    3.802s ( 69%) Reading data
#    0.000s (  0%) Allocation for type bumps (if any), including gc time if triggered
#    0.000s (  0%) Coercing data already read in type bumps (if any)
#    0.041s (  1%) Changing na.strings to NA
#    5.537s        Total
#    user  system elapsed 
#   9.472   1.168  24.391 
PeteHaitch commented 10 years ago

The same thing but using the file ~/Desktop/E18BUF.CG.2.tsv.gz.

scan

# Read 25996755 records
#    user  system elapsed 
# 96.685   1.372  99.012 

gunzip + data.table::fread

# Input contains no \n. Taking this to be a filename to open
# File opened, filesize is 0.799 GB
# File is opened and mapped ok
# Detected eol as \r\n (CRLF) in that order, the Windows standard.
# Looking for supplied sep '\t' on line 30 (the last non blank line in the first 'autostart') ... found ok
# Found 7 columns
# First row with 7 fields occurs on line 1 (either column names or first row of data)
# 'header' changed by user from 'auto' to TRUE
# Count of eol after first data row: 25996756
# Subtracted 1 for last eol and any trailing empty lines, leaving 25996755 data rows
# Type codes: 4111111 (first 5 rows)
# Type codes: 4111111 (+middle 5 rows)
# Type codes: 4111111 (+last 5 rows)
# Type codes: 4111111 (after applying colClasses and integer64)
# Type codes: 4111111 (after applying drop or select (if supplied)
# Allocating 7 column slots (7 - 0 NULL)
# Read 25996755 rows and 7 (of 7) columns from 0.799 GB file in 00:00:09
#    0.061s (  1%) Memory map (rerun may be quicker)
#    0.000s (  0%) sep and header detection
#    3.840s ( 43%) Count rows (wc -l)
#    0.001s (  0%) Column type detection (first, middle and last 5 rows)
#    0.159s (  2%) Allocation of 25996755x7 result (xMB) in RAM
#    4.906s ( 54%) Reading data
#    0.000s (  0%) Allocation for type bumps (if any), including gc time if triggered
#    0.000s (  0%) Coercing data already read in type bumps (if any)
#    0.039s (  0%) Changing na.strings to NA
#    9.007s        Total
#    user  system elapsed 
#  15.051   1.814  41.357 
PeteHaitch commented 10 years ago

The same thing but using the file ~/Desktop/E18BUF.CG.2.tsv, i.e. the unzipped version

scan

# Read 25996755 records
#    user  system elapsed 
# 135.225   1.689 136.941 

gunzip + data.table::fread

# user  system elapsed 
#  7.529   0.613   8.137 
PeteHaitch commented 10 years ago

Summary

There looks to be a 2-3x speed-up using data.table::fread, even allowing for the overhead of gunzip-ing the files. If the .tsv isn't zipped, then data.table::fread is ~17x faster. On speed along, data.table::fread is the obvious choice.

However, having to unzip files via system is rather cumbersome and there doesn't seem an alternative in base R. R.utils provides nice a gunzip helper function but there is no equivalent for bzip2 files.

In summary, to get the speed-up of data.table::fread requires me to Import not only data.table but probably R.utils as well. I'm not sure whether this is worth the extra dependencies. For now, I will include fread.comethylation as a non-exported function that uses data.table::fread rather than scan to do the same thing as read.comethylation. I will think about whether it is worth exporting this function and including the necessary Import statements.

PeteHaitch commented 10 years ago

Hmm, fread.comethylation seems to make the construction of the concrete subclass of the CoMeth VIRTUAL class very slow, potentially even wiping out the gains of fread over scan :disappointed:

At the least, reading the data with fread is a small fraction of the running time when calling fread.comethylation.

Actually, it might be slow regardless of whether scan or fread is used, i.e., the problem might be with the construction of the concrete subclass of the CoMeth VIRTUAL class.

Will need to do more careful benchmarking and profiling; see https://github.com/hadley/lineprof and http://adv-r.had.co.nz/Profiling.html

PeteHaitch commented 10 years ago

Also, fread.comethylation seems to sometimes break. Why?

Error message:

Error in Rle(tsv[, chr]) :
  error in evaluating the argument 'values' in selecting a method for function 'Rle': Error in `[.data.frame`(x, i, j) : object 'chr' not found
Timing stopped at: 18.742 3.217 23.267
PeteHaitch commented 10 years ago

Decided to use data.table::fread. Will keen an eye on development of fastread.