Issues on Transportation Sector CSV Data in GCAM

SoybeanMilk2016 commented 3 months ago

Greetings! I recently commenced my exploration of GCAM, having initiated my studies a week ago. My objective is to employ this renowned model to simulate varied development trajectories of the transportation sector influenced by climate change policies. In the course of my analysis, several questions have arisen. I would appreciate any insights provided.

1. about data modifiability

I have explored transportation-related issues under this repository, such as Issue #261 (Error while using transportation_UCD_highEV.xml). Gratitude is extended to @pkyle for insights that prompted thorough reading of input/gcamdata/inst/extdata/energy/OTAQ_trn_data_EMF37.csv. Here, I am puzzled: Is the CSV data within input/gcamdata/inst/extdata/energy designed to generateinput/gcamdata/xml files for configuration.xml use? This inquiry aims to ascertain if the 2020 to 2100 data in OTAQ_trn_data_EMF37.csv can be freely modified for scenario construction, rather than serving as GCAM outcomes/predictions.

2. about relationship between xml and csv files

The contents of input/gcamdata/inst/extdata/energy/UCD_trn_data_CORE.csv were also read. Is the file input/gcamdata/xml/transportation_UCD_CORE.xml generated solely from the aforementioned CSV file, or do some of parameters in XML stem from a synthesis of raw data across multiple CSV files, refined through specific mathematical formulations by gcamdata? If the latter is the case, is it advisable for a newcomer to first grasp the mathematical formulations underpinning the XML parameters and subsequently adjust associated raw data within the CSV files, rather than directly manipulating the XML file's parameters? Where should I look to find these mathematical formulas behind the parameters? This inquiry arises due to the presence of parameters such as coefficient, capital-coef, depreciation-rate, input-cost in the XML files, which lack direct equivalents in the CSV files, in contrast to parameters like speed, loadFactor, which are found within the CSV files. The precise meanings and units of measurement for these parameters confuse me.

3. about data inconsistencies for identical indicators across two CSV files

Within the directory input/gcamdata/inst/extdata/energy, two files, namely UCD_trn_data_CORE.csv and OTAQ_trn_data_EMF37.csv, are similar to some extent. Despite sharing common parameters, there may be some discrepancies in the values across these files. For example, under the filter path "USA--Passenger--LDV_4W--Compact Car--FCEV--Hydrogen--Capital costs (purchase)", it is observed that in OTAQ_trn_data_EMF37.csv, the values for 2005-2015 are listed as NA, while for 2020, it is 36146.89378. The same parameter in UCD_trn_data_CORE.csv indicates a 2020 value of 21251.17. What accounts for these discrepancies?

4.about share-weight

Sorry for the persisting confusion regarding the concept of "share-weights". Conventionally, sum of weights equals 1, yet this standard does not seem to apply within GCAM (take input/gcamdata/inst/extdata/energy/A54.globaltranTech_shrwt_revised.csv as an example). Is it correct to interpret a value of 1 as denoting the presence of a technology in a given year, and a value of 0 as its totally absence? Could this be considered a binary classification variable? If so, what significance do decimal values hold? Is this mechanism intended to distribute technological changes more uniformly across several years? Additionally, regarding the actual adoption (or penetration) rate of different technologies in a future year (e.g., 2100), should the calculation of proportions be based on "transportation service output" of these technologies for that year, as derived from queries?

5. about the base year

In other issues, I've seen that the base year for GCAM V7 is 2015. Does this mean that modifications to data should preferably be made for periods subsequent to 2015, rather than adjusting data before this base year?

These concerns have emerged over the past week. Your attention to these matters is greatly appreciated, and I look forward to receiving help and answers. Thank you so much!!

pkyle commented 3 months ago

Sorry, just seeing this. Lots of issues these days 😆 UCD_trn_data_CORE.csv is a transportation database that was put together in the 2012-2013 timeframe and is documented independently in Mishra et al. (2013). OTAQ_trn_data_EMF37.csv contains some updated values and new technologies. The two data tables are joined in the code in a way that the latter over-writes the former, without dropping any data. In the R code it's just anti_join (on the ID variables, including the year) followed by bind_rows. The specific parameters found in the XML are the results of unit conversions, weighted averages, and some ancillary assumptions. The transportation databases have 15 global regions which are downscaled to the country level and then re-aggregated to GCAM regions. They have region-specific vehicle size classes, based on the makes and models of specific vehicles that are popular in each region in the base year, which are aggregated into 3 passenger and 3 freight vehicle size classes using standardized mass cutoffs. E.g. Mini Cars are <1000kg, Cars are 1000-1600kg, and Large Car and Truck are >1600kg. The specific mappings are in energy/mappings/UCD_size_class_revisions.csv. The vehicle costs are levelized based on assumptions of annual vehicle travel per vehicle per year (in the input database) and an exogenous discount rate, indicated in constants.R. So, there are a lot of steps between the CSV input tables and what is seen in the XML, but it's all in the R code (files with the strings 154, 254). Also the share-weight values in A54.globaltranTech_shrwt_revised.csv are complemented with interpolation rules in A54.globaltranTech_interp_revised.csv, so the values in the model output are the result of their merge.

SoybeanMilk2016 commented 3 months ago

Sorry, just seeing this. Lots of issues these days 😆 UCD_trn_data_CORE.csv is a transportation database that was put together in the 2012-2013 timeframe and is documented independently in Mishra et al. (2013). OTAQ_trn_data_EMF37.csv contains some updated values and new technologies. The two data tables are joined in the code in a way that the latter over-writes the former, without dropping any data. In the R code it's just anti_join (on the ID variables, including the year) followed by bind_rows. The specific parameters found in the XML are the results of unit conversions, weighted averages, and some ancillary assumptions. The transportation databases have 15 global regions which are downscaled to the country level and then re-aggregated to GCAM regions. They have region-specific vehicle size classes, based on the makes and models of specific vehicles that are popular in each region in the base year, which are aggregated into 3 passenger and 3 freight vehicle size classes using standardized mass cutoffs. E.g. Mini Cars are <1000kg, Cars are 1000-1600kg, and Large Car and Truck are >1600kg. The specific mappings are in energy/mappings/UCD_size_class_revisions.csv. The vehicle costs are levelized based on assumptions of annual vehicle travel per vehicle per year (in the input database) and an exogenous discount rate, indicated in constants.R. So, there are a lot of steps between the CSV input tables and what is seen in the XML, but it's all in the R code (files with the strings 154, 254). Also the share-weight values in A54.globaltranTech_shrwt_revised.csv are complemented with interpolation rules in A54.globaltranTech_interp_revised.csv, so the values in the model output are the result of their merge.

@pkyle Thank you for your insightful response! You have resolved many of my confusions! I will further explore GCAM, guided by the documentation and your recommendations.

JGCRI / gcam-core

Issues on Transportation Sector CSV Data in GCAM #406