Database redo input file errors

nheeren commented 5 years ago

@stefanpauliuk could you please look into the following issues (Somehow I feel like we solved them before. Did we overwrite the files??):

The files 1_UPI_USLCI_Aluminum_cold_rolling_at_plant.xlsx, 1_UPI_USLCI_Limestone_at_mine.xlsx, 1_UPI_USLCI_Chainsawing_delimbing.xlsx, '1_UPI_USLCI_steel_liquid_at_plant.xlsx' use "None" values in aspects 6 and 7. These do not exist in classification 2 ()regions_iso_iedc.

More to follow...

stefanpauliuk commented 5 years ago

After revising the data model, the regional and process aspect can be 'unspecified'. Will update the templates now and insert 'unspecified' to the relevant classifications if not present yet.

nheeren commented 5 years ago

Thanks!

nheeren commented 5 years ago

3_LT_SteelCycle_PAULIUK_2013.xlsx: I assume the cell G12 in the Cover sheet should be "1" and not "custom". Please confirm.

nheeren commented 5 years ago

Thanks for fixing!

nheeren commented 5 years ago

7_CT_EXIOBASEv3_200Products_To_163Products.xlsx: aspect_1 and aspect_2 do not correspond to column names in the Data sheet. Simply rename the data column names?
need to check further: 1_F_LiquidMetalFlows_SteelScrapAge_Pauliuk_2013.xlsx, 1_F_MetalDemand_DEETMAN_2018.xlsx: Data sheets have no headers.

stefanpauliuk commented 5 years ago

7_CT_EXIOBASEv3_200Products_To_163Products.xlsx: Exactly, just a rename.

1_F_LiquidMetalFlows_SteelScrapAge_Pauliuk_2013.xlsx, 1_F_MetalDemand_DEETMAN_2018.xlsx: These are table data, which have no headers, just classification items.

nheeren commented 5 years ago

My bad about 1_F_LiquidMetalFlows_SteelScrapAge_Pauliuk_2013.xlsx, 1_F_MetalDemand_DEETMAN_2018.xlsx.

3_IUP_Vehicles_9Countries_Dhaniati_2012.xlsx contains different values to encode NULL. We seriously need a definition.

stefanpauliuk commented 5 years ago

Good point! For this particular example, an empty cell in the template means 'no data available in this dataset', with the emphasis on "This dataset". in iedc.data, the numbers will be stored in a list format and when exported again (as list), only the non zero values would be provided.

The main question here is: should empty cells get a data table entry or not?

To specify whether or not to enter data, I suggest to distinguish the following cases and put a corresponding string into the cell: 1) No information available, string "N.I.A.", leads to NULL entry in database. Further details (number lacking, not readable, not applicable, etc. should be provided in the comment field or sheet and moved to iedc.data). (E.g. the example cases you used in the building data paper). 2) No data, string "N.D.", is ignored by parser, does not lead to entry in the database.

for 3_IUP_Vehicles_9Countries_Dhaniati_2012.xlsx, it should all be "N.D.", hence no data table entry.

stefanpauliuk commented 5 years ago

PS: 1) The no data, string "N.D." is important for table data, so that datasets with different scope (e.g. spanning different years) still can be put together in one table, unused columns are filled with "N.D".

2) The N.I.A. and N.D. strings are suggestions from my side only, please replace if you have better ideas!

nheeren commented 5 years ago

This is a tricky question (which would deserve its own issue). Since we chose DOUBLE as the data type, we can encode missing, no data, null, na, etc. only as NULL or 0. As we describe in our codebook in the material intensity project, there can be different types of missing data.

For now I will not change any of those values in the data (Excel) files, but have IEDC_tools replace them with NULL values.

Should we create a new issue "Data encoding guidelines" or "missing values"? Maybe others are willing to contribute.

stefanpauliuk commented 5 years ago

Let's keep it simple! The first question is: Should 'no data' in an Excel template be inserted or not?

Here, see my previous comment. In the case of 3_IUP_Vehicles_9Countries_Dhaniati_2012.xlsx, the empty cells should NOT be inserted as the table format was chosen for reasons of convenience only, in a LIST template the blank cells would not have had a corresponding 'no data' entry. Hence, the empty cells in this template should be filled with a string that we mark for this case, e.g. "N.D." as suggested above.

If inserted, I agree with you that there can be differentiation, and these will lead to NULL for data.value and a comment on why this NULL is there, e.g., based on the scheme in the material intensity codebook.

nheeren commented 5 years ago

Closing issue as it has become too broad in this discussion. We might come back to parts of it at a later stage.

IndEcol / IE_data_commons

Database redo input file errors #20