`data.table::fread` - Githubissues

ctmm-initiative / ctmm

Continuous-Time Movement Modeling. Functions for identifying, fitting, and applying continuous-space, continuous-time stochastic movement models to animal tracking data.

http://biology.umd.edu/movement.html

47 stars 10 forks source link

`data.table::fread` #2

Closed ghost closed 7 years ago

ghost commented 7 years ago

before fread support compressed file, or there is a cross platform solution to uncompress files, use parameter zipfile = FALSE for fread, fall back to read.csv when zip file is needed.

For 160M csv, fread took 2.64s while read.csv took 21s.

xhdong-umd commented 7 years ago

The ideal solution is also use fread for zip file. There are several approaches:

It's a popular request in data.table
uncompress file to stdout with fread(input = 'zcat < data.gz'). However windows doesn't have zcat gzip installed by default. It's difficult to create a simple cross platform solution without needs user to install software first.
uncompress file to temp file, read file, delete the temp file. The problem here is the complexity of zip file. There are multiple possible zip methods, including file created by tar which cannot be recognized by R internal function unzip. R.utils used R connections method to uncompress zip to files, but you still need to identify compression method first.

Right now the parameter method is simplest without need of much change to existing code. We can further improve this depend on new usage or development in related packages.

chfleming commented 7 years ago

I think the most important thing is that as.telemetry "just work" with default arguments. I put in some code that checks to see if the filename looks like a CSV, then attempts fread. If the filename doesn't look like a CSV or fread fails, then the slower read.table is used instead.

  data < NULL
  # fread doesn't work on compressed files yet
  if(endsWith(tolower(object),".csv"))
  { data <- try(data.table::fread(object,data.table=FALSE,check.names=TRUE,...)) }
  # if fread fails, then fall back on read.table
  if(class(data)!="data.frame")
  { data <- utils::read.csv(object,...) }

We could add in more logic for different compression formats, but I don't know that the command & pipe notation is the same across platforms.

xhdong-umd commented 7 years ago

@chfleming This is a much better solution compared to extra parameter.

I think there is no need to check compression formats since there are many possibilities and platform compatibility problems.

xhdong-umd commented 7 years ago

@chfleming I think we can actually just fread the first 5 rows without the file name check. It's possible the csv file have different file name (I saw .txt before). How about this:

data <- try(data.table::fread(object, data.table = FALSE, check.names = TRUE, nrows = 5), 
            silent = TRUE)
if (class(data) == "data.frame") {
  data <- data.table::fread(object,data.table=FALSE,check.names=TRUE,...)
} else {
  data <- utils::read.csv(object,...)
}

I think the direct read test should be fast enough that comparable to the file name check, and it will handle all possible cases without complex logic.

chfleming commented 7 years ago

That seems to work well. Pushed.