IndEcol / IE_data_commons

Code and documentation for a commons of structured industrial ecology data
MIT License
22 stars 2 forks source link

Mismatch between Excel template and database #7

Closed nheeren closed 5 years ago

nheeren commented 6 years ago

The database's data table has the following columns

CREATE TABLE `data` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `dataset_id` int(11) NOT NULL,
  `aspect1` int(11) NOT NULL,
  `aspect2` int(11) DEFAULT NULL,
  `aspect3` int(11) DEFAULT NULL,
  `aspect4` int(11) DEFAULT NULL,
  `aspect5` int(11) DEFAULT NULL,
  `aspect6` int(11) DEFAULT NULL,
  `aspect7` int(11) DEFAULT NULL,
  `aspect8` int(11) DEFAULT NULL,
  `aspect9` int(11) DEFAULT NULL,
  `aspect10` int(11) DEFAULT NULL,
  `aspect11` int(11) DEFAULT NULL,
  `aspect12` int(11) DEFAULT NULL,
  `value` double NOT NULL,
  `unit_nominator` int(11) NOT NULL,
  `unit_denominator` int(11) DEFAULT NULL,
  `stats_array_1` int(11) DEFAULT NULL,
  `stats_array_2` double DEFAULT NULL,
  `stats_array_3` double DEFAULT NULL,
  `stats_array_4` double DEFAULT NULL,
  `comment` text,
  `reserve1` varchar(255) DEFAULT NULL,
  `reserve2` varchar(255) DEFAULT NULL,
  `reserve3` varchar(255) DEFAULT NULL,
  PRIMARY KEY (`id`)
) 

The Excel template uses its aspects and then the following columns: value | unit nominator | unit denominator | stats_array string

That means they are spelled differently and will therefore not automatically match.

Probably I still haven't fully understood the concept. There are also Excel files in the template folder that use completely different structure in Values_Master an I wouldn't know how to parse them at all.

stefanpauliuk commented 6 years ago

In the templates, the cover sheet contains a 1:1 dataset table entry, but the actual data are NOT formatted in the IEDC format. Instead, the data can come in two format types: LIST (what you parsed) and TABLE. While LIST is easy to parse, for TABLE the index structure of row and columns has to be taken into account. This is why there is this funny blue box on the 'cover' sheet. The whole concept is meant to increase usefulness, as I don't want to force data suppliers to use either LIST or TABLE. Instead, they do whatever is easy for them, choose a table index structure that is convenient, and let the parser do the rest. The reading of such parameters is already implemented in ODYM: https://github.com/IndEcol/ODYM/blob/master/odym/modules/ODYM_Functions.py , function ReadParameter()

The parser then has to take apart the different pieces of information (stats_array_string, units), see examples in https://github.com/IndEcol/IE_data_commons/tree/master/IEDC_content_fill/Dataset_Upload

nheeren commented 6 years ago

Does that mean IEDC_tools should be able to parse LIST and TABLE type data?

So how is the parser supposed to map e.g. stats_array string to stats_array_1?

stefanpauliuk commented 6 years ago

Yes, both LIST and TABLE. To make it worse, for TABLE data users can specify to report units, uncertainty, and comments on the cover sheet (GLOBAL option, one common value for all data items, applies to most datasets so far), or on a separate sheet (TABLE option cf. \Dropbox\G7 RECC\Data\IE_Data_Commons_Prototype\Data_ODYM_Format\1_F_WIO_Japan_Nakamura_Kondo_2002.xlsx for an example)

The stats_array string is a shortcut to avoid having 4 columns each time. Parsing is done with str.split, see https://github.com/IndEcol/IE_data_commons/blob/master/IEDC_content_fill/Dataset_Upload/3_MC_NACEv2_4000Groups_Upload.py for an example:

# parse stats_array_string
UncString  = DataSheet.cell_value(currentrow ,currentcolumn)
UncParts   = S.split(';')
U1 = int(UncParts[0])
U2 = np.float(UncParts[1])
U3 = np.float(UncParts[2])
U4 = np.float(UncParts[3])

# U1 ... U4 is then used in SQL statement to fill stats_array_1 to stats_array_4.
nheeren commented 5 years ago

resolved