Open Julien-Lopez opened 3 years ago
caveat: I think there is a typo in your example, the example doesn't look like it's in %d-%m-%y
format... I'm assuming anyway that it's right in what I write below
fread()
does not and I think will never support native detection of/customization to non-default date formats during import in C. It's just too much of a maintenance burden IMO, with no clearly defined scope (why cater to this format, but not this one? scope creep abounds). In cases where it really is a performance burden (e.g. 10M / 100M + rows), please try and correct the data upstream to be produced in a standard format.
So that leaves supporting a way to provide formats in the fread
signature that would be applied at the R level. The first thing that comes to mind would be to extend colClasses
, e.g.
data.table::fread("date\n2020-10-10", colClasses = list(date = function(x) as.POSIXct(x, "%d-%m-%y")))
Or perhaps a new argument with dateFormat
:
data.table::fread("date\n2020-10-10", colClasses = list(Date = "date"), dateFormat = "%d-%m-%y")
and as.Date()
is called as as.Date(x, dateFormat)
internally.
This is not bad, but to we also need a timeFormat
? or is the argument like formatting_args = c(Date = "%d-%m-%y", POSIXct = "%d-%m-%y %T")
?
Ultimately, given that there's no performance benefit, I'm a bit loath to implement this myself. The following looks OK to me:
DT = data.table::fread("date\n2020-10-10")
DT[ , date := as.Date(as.character(date), "%d-%m-%y")]
or
DT = data.table::fread("date\n2020-10-10", colClasses="character")
DT[ , date := as.Date(date, "%d-%m-%y")]
And lapply()
versions for handling many columns are ~ the same.
I wonder if it's possible to just have ...
as an fread
argument, and send ...
to whatever as
method is being used by colClasses
... that could probably buy us a decent amount of flexibility without being too specific.
Related issue #1656
Apologies for the typo in the example. I fixed it.
Thank you for the answers! Looking at the source code of fread, I was hoping that indeed the transfer of the format argument to as.Date
and as.POSIXct
would be an easy enough change.
If not, do you think the option datatable.old.fread.datetime.character
could be made more permanent at least? Converting a column in a format to date in post-processing is simple enough, but it would be nice not to have to revert auto-detected dates columns back to character.
have you tried colClasses='character' for those columns? should accomplish the same
disclaimer:
lubridate
package, so if looking for a pure data.table
solution, you would have to modify the appropriate format references and function call(s) to use POSIX conventions.potential usage:
# set options for fread
# Alternate date formats HAVE to be set prior to the fread call.
# no dateFormat parameter is possible in the fread call itself.
options(datatable.fread.dateFormat=c("dmY T"))
# read in file using `MorphDate` class (defined below)
fread("myFile.csv",
na.strings=c("NA","N/A","Not Recorded","NR","NULL"),
colClasses=list(MorphDate=c("dd-mm-yyyy_formatted_datestamp","date_with_timestamp")))
Code
library('lubridate')
library('data.table')
### Set up options for data.table, including an "invented" one for the new class we will define ###
options(datatable.fread.dateFormat=c("dmY T"),
datatable.na.strings=c("N/A","NA","Not Recorded"))
### Create a custom Class with `as` creation function ###
setClass("MorphDate", prototype=NA_integer_)
as.MorphDate.character <- function(from) {
# from[which(from=="Not Recorded")] <- NA # handle with na.strings option of fread
DateObj <- lubridate::parse_date_time(from, orders=getOption("datatable.fread.dateFormat","Ymd T"), truncated=1)
doi <- as.integer(DateObj)
if (any((doi %% 86400L) > 0L, na.rm = TRUE)) {
attr(DateObj,"class") <- c("MorphDate","POSIXct","POSIXt") # could use IDateTime class, but that class uses a two column data.table of IDate and ITime and would require an explicit call to a conversion function.
} else if (all(doi < 86400L, na.rm = TRUE)) {
DateObj <- as.ITime(doi) # I don't know if the call to ITime is strictly required. setting the class may be enough.
attr(DateObj,"class") <- c("MorphDate","ITime")
} else {
DateObj <- as.Date(doi %/% 86400L, origin='1970-01-01') # or as.IDate if that is available.
attr(DateObj,"class") <- c("MorphDate","IDate","Date")
}
DateObj
}
### Define conversion functions to/from character string
MorphDate <- as.MorphDate.character
as.character.MorphDate <- function(from) format(from)
setAs("character","MorphDate",as.MorphDate.character)
setAs("MorphDate","character",as.character.MorphDate)
print.MorphDate <- function(x, ...) print(format(x, ...))
setOldClass("MorphDate",S4Class="MorphDate")
### END Class Def ###
### read in file via fread, using new "MorphDate" class
fread("myFile.csv",
colClasses=list(MorphDate=c("dd-mm-yyyy_formatted_datestamp_column","date_with_timestamp_column")))
As described in this Stack Overflow thread, it would be nice to be able to specify the format of dates types that are automatically detected by fread.
Since both as.Date and as.POSIXct seem to support formatting, I assume it wouldn't be difficult to implement with one format. Maybe this could be extended to a vector of formats? Also, the empty vector could be used to disable the detection entirely thus falling back to the old behavior if needed by the user.