JGCRI / gcamdata

The GCAM data system
https://jgcri.github.io/gcamdata/
Other
41 stars 26 forks source link

EIA_product_fuel needs a data chunk because of its use of "NA" #300

Open pkyle opened 7 years ago

pkyle commented 7 years ago

To read in data from CSV, the function called by driver() and in turn load_csv_files() is read_csv(), and we're currently using all of its defaults except the comment character. The default na strings for read_csv are: na = c("", "NA") In the old data system, the only na.string was "". However (for one example that I came across today) IMF_GDP_growth uses "n/a", and in GCAM-USA, the EIA SEDS database uses the string "NA" as a product code for natural gasoline and isopentane, which would currently be read as a missing value. While these and (potentially) other issues could just be addressed as they come up within individual chunks, it would be better to just assign some global defaults in the read_csv function call that avoid these issues. So at present, I'd want the function call to say (in line 37 of utils.R): na = c( "", "n/a"), ... But if any of you encounter missing values read as characters (one symptom is a \ variable where you were expecting \ or \), please post them here in case we can deal with it upstream of the individual chunk(s). And if any of you are relying on "NA" values in an input file being interpreted as a missing value string, please indicate that info in a comment here as well. Thanks!

rplzzz commented 7 years ago

What we could do is have load_csv_files use c('', 'n/a', 'N/A') as its NA strings (I believe those should all be safe), and any file that breaks under that convention we can load in a specialized chunk, as with zchunk_data_maddison_population.R. The rule would be that any chunk that reads data (vs. getting it with get_data) must fully clean the data before passing it on to the rest of the system.

bpbond commented 7 years ago

Yeah, exactly, agreed. A standard set of NA strings is fine, but there will be exceptions, and those (under our current design) are handled by 'data chunks' specific to the file in question.

rplzzz commented 7 years ago

Ok, who wants to volunteer?

bpbond commented 7 years ago

This doesn't seem to be causing any problems, and we're following the rules laid out above. Closing.

pkyle commented 7 years ago

OK I just checked and the reason it didn't cause any problems is that the NA product isn't currently mapped to anything. If it had been assigned to refined liquids, there would be a problem here. Here's a screenshot of the tibble EIA_SEDS_fuels:

eia_seds_fuels

I'm not sure why it isn't mapped to refined liquids (I'd need to review the EIA SEDS documentation, and it's possible that it should have been mapped), but in the current set-up, we can't map it to anything, because the product code is read as a missing value.

bpbond commented 7 years ago

Thanks, Page. OK, let's reopen this but change the title to more clearly reflect the problem.

bpbond commented 6 years ago

@pkyle I went and looked at EIA_product_fuel.csv and it doesn't match what you show above. And we don't have a EIA_SEDS_fuels file that I see. The IMF_GDP_growth example does get read incorrectly, as "n/a" instead of NA, but this is handled in line 188 of L102.GDP.R.

Is this a problem in the current dsr?

pkyle commented 6 years ago

@bpbond this August 8 dump of issues was perfectly timed! The one 24-hour window that I wouldn't see 'em for 7 weeks. I made a mistake in my comment above that I just corrected--these files are all in gcam-usa-data/mappings (not in energy-data/mappings), and the file I was referring to was EIA_SEDS_fuels.csv. Yeah the n/a in the IMF data is read as a character string; we could add it to our global na.strings but I didn't see the need to do that at the time.