ctmm-initiative / ctmm

Continuous-Time Movement Modeling. Functions for identifying, fitting, and applying continuous-space, continuous-time stochastic movement models to animal tracking data.
http://biology.umd.edu/movement.html
47 stars 10 forks source link

`data.table::fread` #2

Closed ghost closed 7 years ago

ghost commented 7 years ago

For 160M csv, fread took 2.64s while read.csv took 21s.

xhdong-umd commented 7 years ago

The ideal solution is also use fread for zip file. There are several approaches:

Right now the parameter method is simplest without need of much change to existing code. We can further improve this depend on new usage or development in related packages.

chfleming commented 7 years ago

I think the most important thing is that as.telemetry "just work" with default arguments. I put in some code that checks to see if the filename looks like a CSV, then attempts fread. If the filename doesn't look like a CSV or fread fails, then the slower read.table is used instead.

  data < NULL
  # fread doesn't work on compressed files yet
  if(endsWith(tolower(object),".csv"))
  { data <- try(data.table::fread(object,data.table=FALSE,check.names=TRUE,...)) }
  # if fread fails, then fall back on read.table
  if(class(data)!="data.frame")
  { data <- utils::read.csv(object,...) }

We could add in more logic for different compression formats, but I don't know that the command & pipe notation is the same across platforms.

xhdong-umd commented 7 years ago

@chfleming This is a much better solution compared to extra parameter.

I think there is no need to check compression formats since there are many possibilities and platform compatibility problems.

xhdong-umd commented 7 years ago

@chfleming I think we can actually just fread the first 5 rows without the file name check. It's possible the csv file have different file name (I saw .txt before). How about this:

data <- try(data.table::fread(object, data.table = FALSE, check.names = TRUE, nrows = 5), 
            silent = TRUE)
if (class(data) == "data.frame") {
  data <- data.table::fread(object,data.table=FALSE,check.names=TRUE,...)
} else {
  data <- utils::read.csv(object,...)
}

I think the direct read test should be fast enough that comparable to the file name check, and it will handle all possible cases without complex logic.

chfleming commented 7 years ago

That seems to work well. Pushed.