Open bpbond opened 5 years ago
Re performance see also #1068
fread
does not currently support comment characters (see https://github.com/Rdatatable/data.table/issues/856) so we'd need to work around that.
fread
offers a pre-processing step that makes this easy
fread(cmd = paste("sed '/^#/d'", fn))
...but Windows? This might be a deal-breaker.
Since we are already have a function for reading the comment header, how about something like this?
header <- read_comment_header(file)
data <- data.table(fread, skip = length(header))
Thanks. Yeah I'm just not sure if there are comments elsewhere in some files.
Just checked and there are, but only a handful:
aglu/Mueller_crops.csv:6:# Source: Mueller, N. D.; Gerber, J. S.; Johnston, M.; Ray, D. K.; Ramankutty, N.; Foley, J. A. Closing yield gaps through nutrient and water management. Nature 2012, 490 (7419), 254ñ257.,
energy/A54.tranSubsector_shrwt.csv:59:# Passthrough tranSubsectors are listed below
energy/A54.tranSubsector_logit.csv:58:# Passthrough tranSubsectors are listed below,,,,absolute-cost-logit
energy/A54.tranSubsector_VOTT.csv:31:# Pass-through tranSubsectors
energy/A54.tranSubsector_VOTT_ssp1.csv:31:# Pass-through tranSubsectors
If we wanted to continue to support comments in the file bodies (which I personally think is a bad idea, but I understand the rationale), then that does rule out fread
for now. But if we fix these 5 files, then the skip = length(header)
approach should work.
Huh, thanks for checking! As usual, it's too bad about Windows because the cmd =
thing above is super slick and easy.
@bpbond As discussed earlier, the readr tries to parse integers written in scientific notations as doubles. Changing the coltypes to ‘i’ does not solve this problem when there are mixed data types in a column. You get the following error:
Error : (converted from warning) 1 parsing failure. row col expected actual file 7 2009 no trailing characters .14E+05 'C:/Users/nara836/Documents/R/win-library/3.6/gcamdata/extdata/aglu/FAO/FAO_Fert_Prod_tN_RESOURCESTAT.csv'
Error in load_csv_files(unfound_chunk$input, unfound_chunk$optional, quiet = TRUE) : Error or warning while reading FAO_Fert_Prod_tN_RESOURCESTAT.csv
The workaround I have been using, is gathering all values into 1 column and converting entire "Value" column to an integer using as.type and then spreading back.
@kanishkan91 what is the dataset that is reporting data with scientific notation that should be read as integer? I can't think of any case where that would be desirable, and don't see any in the current gcamdata repo.
@pkyle I found this in the FAO data. Where a number is too large, the FAO dataset will present that as a scientific notation as opposed to a number. This needs to be converted back to an integer, or the driver crashes with the above mentioned error.
I'm sorry I missed the error; what is the problem with having all of these data as doubles? Does that cause an error? Quantities (e.g. tonnes of fertilizer) are not really integers; they're just rounded to the nearest unit.
@pkyle Thanks for helping with the debug on this. If you use a coltype of 'n' in the metadata, the problem goes away. But we have historically been reading in data from FAO as integers. I will update that. But others may run into this error elsewhere where they actually want to read data in as integers.
Great! Just to clarify and put in writing what we discussed, the reason for data columns input as i
is simply that we carried readr's estimates (where they worked, anyway!) when we first rolled out the manually specified column types. However the integer column type should be reserved for things like years and item codes that can't have decimal values. Value data that is rounded to the nearest unit should be specified as n
, not i
.
As I noted in #1073 , it's not clear that we need the readr dependency at this point; could simply pass the column class information
read.csv
.@ashiklom noted on Slack as well:
Have we ever considered using
data.table::fread
instead (especially since we already import it)?Two other thoughts I had:
data.table
is so fast, we may be able to achieve speed gains without much re-writing by usingdtplyr
for at least some functions. E.g.dtplyr::left_join
-- https://github.com/hadley/dtplyr/blob/master/R/joins.Rvroom
, which uses R 3.5's newALTREP
framework (which, as far as I understand, is black magic), to read data lazily and therefore claims to achieve blisteringly fast speeds. Currently in active development, and not on CRAN, but maybe worth a GitHub star: https://github.com/jimhester/vroom