Closed PeteHaitch closed 10 years ago
A scan
approach:
file <- "~/Desktop/E18BUF.CG.1.tsv.gz"
if (grepl("\\.gz$", file)){
con <- gzfile(file)
} else if (grepl("\\.bz2$", file)){
con <- bzfile(file)
} else {
con <- file(file)
}
# Construct header and column types. The first column is a character, the rest are int
column_headers <- strsplit(readLines(con, n = 1), '\t')[[1]]
what0 <- replicate(length(column_headers), character(0))
names(what0) <- column_headers
what0[-1] <- replicate(length(column_headers) - 1, integer(0))
x <- scan(con, skip = 1, what = what0, sep = "\t", quote = "", na.strings = "NA", quiet = FALSE, comment.char = "")
close(con)
Timing:
#Read 27278373 records
# user system elapsed
# 59.127 0.773 60.162
A gunzip
+ data.table::fread
approach:
system(command = "gunzip -d -c ~/Desktop/E18BUF.CG.1.tsv.gz > ~/Desktop/E18BUF.CG.1.tsv")
y <- fread(input = "~/Desktop/E18BUF.CG.1.tsv",
sep = '\t',
header = TRUE,
verbose = TRUE,
showProgress = TRUE)
unlink("~/Desktop/E18BUF.CG.1.tsv")
Timing:
# Input contains no \n. Taking this to be a filename to open
# File opened, filesize is 0.514 GB
# File is opened and mapped ok
# Detected eol as \r\n (CRLF) in that order, the Windows standard.
# Looking for supplied sep '\t' on line 30 (the last non blank line in the first 'autostart') ... found ok
# Found 4 columns
# First row with 4 fields occurs on line 1 (either column names or first row of data)
# 'header' changed by user from 'auto' to TRUE
# Count of eol after first data row: 27278374
# Subtracted 1 for last eol and any trailing empty lines, leaving 27278373 data rows
# Type codes: 4111 (first 5 rows)
# Type codes: 4111 (+middle 5 rows)
# Type codes: 4111 (+last 5 rows)
# Type codes: 4111 (after applying colClasses and integer64)
# Type codes: 4111 (after applying drop or select (if supplied)
# Allocating 4 column slots (4 - 0 NULL)
# Read 27278373 rows and 4 (of 4) columns from 0.514 GB file in 00:00:06
# 0.060s ( 1%) Memory map (rerun may be quicker)
# 0.000s ( 0%) sep and header detection
# 1.460s ( 26%) Count rows (wc -l)
# 0.002s ( 0%) Column type detection (first, middle and last 5 rows)
# 0.172s ( 3%) Allocation of 27278373x4 result (xMB) in RAM
# 3.802s ( 69%) Reading data
# 0.000s ( 0%) Allocation for type bumps (if any), including gc time if triggered
# 0.000s ( 0%) Coercing data already read in type bumps (if any)
# 0.041s ( 1%) Changing na.strings to NA
# 5.537s Total
# user system elapsed
# 9.472 1.168 24.391
The same thing but using the file ~/Desktop/E18BUF.CG.2.tsv.gz
.
scan
# Read 25996755 records
# user system elapsed
# 96.685 1.372 99.012
gunzip + data.table::fread
# Input contains no \n. Taking this to be a filename to open
# File opened, filesize is 0.799 GB
# File is opened and mapped ok
# Detected eol as \r\n (CRLF) in that order, the Windows standard.
# Looking for supplied sep '\t' on line 30 (the last non blank line in the first 'autostart') ... found ok
# Found 7 columns
# First row with 7 fields occurs on line 1 (either column names or first row of data)
# 'header' changed by user from 'auto' to TRUE
# Count of eol after first data row: 25996756
# Subtracted 1 for last eol and any trailing empty lines, leaving 25996755 data rows
# Type codes: 4111111 (first 5 rows)
# Type codes: 4111111 (+middle 5 rows)
# Type codes: 4111111 (+last 5 rows)
# Type codes: 4111111 (after applying colClasses and integer64)
# Type codes: 4111111 (after applying drop or select (if supplied)
# Allocating 7 column slots (7 - 0 NULL)
# Read 25996755 rows and 7 (of 7) columns from 0.799 GB file in 00:00:09
# 0.061s ( 1%) Memory map (rerun may be quicker)
# 0.000s ( 0%) sep and header detection
# 3.840s ( 43%) Count rows (wc -l)
# 0.001s ( 0%) Column type detection (first, middle and last 5 rows)
# 0.159s ( 2%) Allocation of 25996755x7 result (xMB) in RAM
# 4.906s ( 54%) Reading data
# 0.000s ( 0%) Allocation for type bumps (if any), including gc time if triggered
# 0.000s ( 0%) Coercing data already read in type bumps (if any)
# 0.039s ( 0%) Changing na.strings to NA
# 9.007s Total
# user system elapsed
# 15.051 1.814 41.357
The same thing but using the file ~/Desktop/E18BUF.CG.2.tsv
, i.e. the unzipped version
scan
# Read 25996755 records
# user system elapsed
# 135.225 1.689 136.941
gunzip + data.table::fread
# user system elapsed
# 7.529 0.613 8.137
There looks to be a 2-3x speed-up using data.table::fread
, even allowing for the overhead of gunzip
-ing the files. If the .tsv
isn't zipped, then data.table::fread
is ~17x faster. On speed along, data.table::fread
is the obvious choice.
However, having to unzip files via system
is rather cumbersome and there doesn't seem an alternative in base R. R.utils
provides nice a gunzip
helper function but there is no equivalent for bzip2 files.
In summary, to get the speed-up of data.table::fread
requires me to Import not only data.table
but probably R.utils
as well. I'm not sure whether this is worth the extra dependencies. For now, I will include fread.comethylation
as a non-exported function that uses data.table::fread
rather than scan
to do the same thing as read.comethylation
. I will think about whether it is worth exporting this function and including the necessary Import statements.
Hmm, fread.comethylation
seems to make the construction of the concrete subclass of the CoMeth
VIRTUAL class very slow, potentially even wiping out the gains of fread
over scan
:disappointed:
At the least, reading the data with fread
is a small fraction of the running time when calling fread.comethylation
.
Actually, it might be slow regardless of whether scan
or fread
is used, i.e., the problem might be with the construction of the concrete subclass of the CoMeth
VIRTUAL class.
Will need to do more careful benchmarking and profiling; see https://github.com/hadley/lineprof and http://adv-r.had.co.nz/Profiling.html
Also, fread.comethylation
seems to sometimes break. Why?
Error message:
Error in Rle(tsv[, chr]) :
error in evaluating the argument 'values' in selecting a method for function 'Rle': Error in `[.data.frame`(x, i, j) : object 'chr' not found
Timing stopped at: 18.742 3.217 23.267
Decided to use data.table::fread
. Will keen an eye on development of fastread
.
Try replacing
scan
withdata.table::fread
inread.comethylation
. If there is a significant speed-up then either import thedata.table
package, or write a bespoke function that imitatesfread
forcometh.tsv
files.