Performance considerations; shift away from readr?

bpbond commented 5 years ago

As I noted in #1073 , it's not clear that we need the readr dependency at this point; could simply pass the column class information read.csv.

@ashiklom noted on Slack as well:

Have we ever considered using data.table::fread instead (especially since we already import it)?

Two other thoughts I had:

Since data.table is so fast, we may be able to achieve speed gains without much re-writing by using dtplyr for at least some functions. E.g. dtplyr::left_join -- https://github.com/hadley/dtplyr/blob/master/R/joins.R
One package to keep an eye on is vroom, which uses R 3.5's new ALTREP framework (which, as far as I understand, is black magic), to read data lazily and therefore claims to achieve blisteringly fast speeds. Currently in active development, and not on CRAN, but maybe worth a GitHub star: https://github.com/jimhester/vroom

bpbond commented 5 years ago

Re performance see also #1068

bpbond commented 5 years ago

fread does not currently support comment characters (see https://github.com/Rdatatable/data.table/issues/856) so we'd need to work around that.

bpbond commented 5 years ago

fread offers a pre-processing step that makes this easy

fread(cmd = paste("sed '/^#/d'", fn))

...but Windows? This might be a deal-breaker.

ashiklom commented 5 years ago

Since we are already have a function for reading the comment header, how about something like this?

header <- read_comment_header(file)
data <- data.table(fread, skip = length(header))

bpbond commented 5 years ago

Thanks. Yeah I'm just not sure if there are comments elsewhere in some files.

ashiklom commented 5 years ago

Just checked and there are, but only a handful:

aglu/Mueller_crops.csv:6:# Source: Mueller, N. D.; Gerber, J. S.; Johnston, M.; Ray, D. K.; Ramankutty, N.; Foley, J. A. Closing yield gaps through nutrient and water management. Nature 2012, 490 (7419), 254ñ257.,
energy/A54.tranSubsector_shrwt.csv:59:# Passthrough tranSubsectors are listed below
energy/A54.tranSubsector_logit.csv:58:# Passthrough tranSubsectors are listed below,,,,absolute-cost-logit
energy/A54.tranSubsector_VOTT.csv:31:# Pass-through tranSubsectors
energy/A54.tranSubsector_VOTT_ssp1.csv:31:# Pass-through tranSubsectors

If we wanted to continue to support comments in the file bodies (which I personally think is a bad idea, but I understand the rationale), then that does rule out fread for now. But if we fix these 5 files, then the skip = length(header) approach should work.

bpbond commented 5 years ago

Huh, thanks for checking! As usual, it's too bad about Windows because the cmd = thing above is super slick and easy.

kanishkan91 commented 5 years ago

@bpbond As discussed earlier, the readr tries to parse integers written in scientific notations as doubles. Changing the coltypes to ‘i’ does not solve this problem when there are mixed data types in a column. You get the following error:

Error : (converted from warning) 1 parsing failure. row col expected actual file 7 2009 no trailing characters .14E+05 'C:/Users/nara836/Documents/R/win-library/3.6/gcamdata/extdata/aglu/FAO/FAO_Fert_Prod_tN_RESOURCESTAT.csv'

Error in load_csv_files(unfound_chunk$input, unfound_chunk$optional, quiet = TRUE) : Error or warning while reading FAO_Fert_Prod_tN_RESOURCESTAT.csv

The workaround I have been using, is gathering all values into 1 column and converting entire "Value" column to an integer using as.type and then spreading back.

pkyle commented 5 years ago

@kanishkan91 what is the dataset that is reporting data with scientific notation that should be read as integer? I can't think of any case where that would be desirable, and don't see any in the current gcamdata repo.

kanishkan91 commented 5 years ago

@pkyle I found this in the FAO data. Where a number is too large, the FAO dataset will present that as a scientific notation as opposed to a number. This needs to be converted back to an integer, or the driver crashes with the above mentioned error.

pkyle commented 5 years ago

I'm sorry I missed the error; what is the problem with having all of these data as doubles? Does that cause an error? Quantities (e.g. tonnes of fertilizer) are not really integers; they're just rounded to the nearest unit.

kanishkan91 commented 5 years ago

@pkyle Thanks for helping with the debug on this. If you use a coltype of 'n' in the metadata, the problem goes away. But we have historically been reading in data from FAO as integers. I will update that. But others may run into this error elsewhere where they actually want to read data in as integers.

pkyle commented 5 years ago

Great! Just to clarify and put in writing what we discussed, the reason for data columns input as i is simply that we carried readr's estimates (where they worked, anyway!) when we first rolled out the manually specified column types. However the integer column type should be reserved for things like years and item codes that can't have decimal values. Value data that is rounded to the nearest unit should be specified as n, not i.

JGCRI / gcamdata

Performance considerations; shift away from readr? #1078