Rdatatable / data.table

R's data.table package extends data.frame:
http://r-datatable.com
Mozilla Public License 2.0
3.62k stars 986 forks source link

Add a dateFormat argument to fread #5006

Open Julien-Lopez opened 3 years ago

Julien-Lopez commented 3 years ago

As described in this Stack Overflow thread, it would be nice to be able to specify the format of dates types that are automatically detected by fread.

data.table::fread("date\n10-10-2020", dateFormat="%d-%m-%y")

Since both as.Date and as.POSIXct seem to support formatting, I assume it wouldn't be difficult to implement with one format. Maybe this could be extended to a vector of formats? Also, the empty vector could be used to disable the detection entirely thus falling back to the old behavior if needed by the user.

MichaelChirico commented 3 years ago

caveat: I think there is a typo in your example, the example doesn't look like it's in %d-%m-%y format... I'm assuming anyway that it's right in what I write below

fread() does not and I think will never support native detection of/customization to non-default date formats during import in C. It's just too much of a maintenance burden IMO, with no clearly defined scope (why cater to this format, but not this one? scope creep abounds). In cases where it really is a performance burden (e.g. 10M / 100M + rows), please try and correct the data upstream to be produced in a standard format.

So that leaves supporting a way to provide formats in the fread signature that would be applied at the R level. The first thing that comes to mind would be to extend colClasses, e.g.

data.table::fread("date\n2020-10-10", colClasses = list(date = function(x) as.POSIXct(x, "%d-%m-%y")))

Or perhaps a new argument with dateFormat:

data.table::fread("date\n2020-10-10", colClasses = list(Date = "date"), dateFormat = "%d-%m-%y")

and as.Date() is called as as.Date(x, dateFormat) internally.

This is not bad, but to we also need a timeFormat? or is the argument like formatting_args = c(Date = "%d-%m-%y", POSIXct = "%d-%m-%y %T")?

Ultimately, given that there's no performance benefit, I'm a bit loath to implement this myself. The following looks OK to me:

DT = data.table::fread("date\n2020-10-10")
DT[ , date := as.Date(as.character(date), "%d-%m-%y")]

or

DT = data.table::fread("date\n2020-10-10", colClasses="character")
DT[ , date := as.Date(date, "%d-%m-%y")]

And lapply() versions for handling many columns are ~ the same.

MichaelChirico commented 3 years ago

I wonder if it's possible to just have ... as an fread argument, and send ... to whatever as method is being used by colClasses... that could probably buy us a decent amount of flexibility without being too specific.

jangorecki commented 3 years ago

Related issue #1656

Julien-Lopez commented 3 years ago

Apologies for the typo in the example. I fixed it.

Thank you for the answers! Looking at the source code of fread, I was hoping that indeed the transfer of the format argument to as.Date and as.POSIXct would be an easy enough change.

If not, do you think the option datatable.old.fread.datetime.character could be made more permanent at least? Converting a column in a format to date in post-processing is simple enough, but it would be nice not to have to revert auto-detected dates columns back to character.

MichaelChirico commented 3 years ago

have you tried colClasses='character' for those columns? should accomplish the same

MPagel commented 2 years ago

disclaimer:

potential usage:

# set options for fread
# Alternate date formats HAVE to be set prior to the fread call.
# no dateFormat parameter is possible in the fread call itself.
options(datatable.fread.dateFormat=c("dmY T"))

# read in file using `MorphDate` class (defined below)
fread("myFile.csv",
    na.strings=c("NA","N/A","Not Recorded","NR","NULL"),
    colClasses=list(MorphDate=c("dd-mm-yyyy_formatted_datestamp","date_with_timestamp")))

Code

library('lubridate')
library('data.table')

### Set up options for data.table, including an "invented" one for the new class we will define ###
options(datatable.fread.dateFormat=c("dmY T"),
        datatable.na.strings=c("N/A","NA","Not Recorded"))

### Create a custom Class with `as` creation function ###
setClass("MorphDate", prototype=NA_integer_)
as.MorphDate.character <- function(from) {
  # from[which(from=="Not Recorded")] <- NA # handle with na.strings option of fread
  DateObj <- lubridate::parse_date_time(from, orders=getOption("datatable.fread.dateFormat","Ymd T"), truncated=1)
  doi <- as.integer(DateObj)
  if (any((doi %% 86400L) > 0L, na.rm = TRUE)) {
    attr(DateObj,"class") <- c("MorphDate","POSIXct","POSIXt")     # could use IDateTime class, but that class uses a two column data.table of IDate and ITime and would require an explicit call to a conversion function.
  } else if (all(doi < 86400L,  na.rm = TRUE)) {
    DateObj <- as.ITime(doi)                                       # I don't know if the call to ITime is strictly required.  setting the class may be enough.
    attr(DateObj,"class") <- c("MorphDate","ITime")
  } else {
    DateObj <- as.Date(doi %/% 86400L, origin='1970-01-01')  # or as.IDate if that is available.
    attr(DateObj,"class") <- c("MorphDate","IDate","Date")
  }
  DateObj
}

### Define conversion functions to/from character string
MorphDate <- as.MorphDate.character
as.character.MorphDate <- function(from) format(from)
setAs("character","MorphDate",as.MorphDate.character)
setAs("MorphDate","character",as.character.MorphDate)
print.MorphDate <- function(x, ...) print(format(x, ...))
setOldClass("MorphDate",S4Class="MorphDate")
### END Class Def ###

### read in file via fread, using new "MorphDate" class
fread("myFile.csv",
    colClasses=list(MorphDate=c("dd-mm-yyyy_formatted_datestamp_column","date_with_timestamp_column")))