How to deal with empty data cells in LIST or TABLE data templates

stefanpauliuk commented 5 years ago

Solution: On the 'cover' sheet of the LIST and TABLE data templates, cell F5, a new Flag Insert_Empty_Cells_as_NULL can (but does not have to) be added to the template. The corresponding value for this flag is listed in cell G5 on the same sheet.

if G5 == True: insert empty data cells as NULL into the data table if G5 == False OR G5 is empty: don't insert empty data cells as NULL into the data table

For an example of this new flag in a working template, cf. 3_MC_Buildings_Heeren_Fishman_ScientificData_2019_V1.1.xlsx

nheeren commented 5 years ago

Note to self: Also look into https://github.com/IndEcol/IE_data_commons/issues/23#issuecomment-507737192, i.e. 6_MIP_SI.POV.GINI_WorldBank_2019

stefanpauliuk commented 4 years ago

There are now a few TABLE data templates that use this feature (cell G5).

nheeren commented 4 years ago

if G5 == True: insert empty data cells as NULL into the data table if G5 == False OR G5 is empty: don't insert empty data cells as NULL into the data table

I'd need a short feedback:

What should be inserted instead of NULL? 0?
Should it rather raise an error and fail?
Is it correct that G5 in 3_MC_Buildings_Heeren_Fishman_ScientificData_2019_V1.1 and 6_MIP_SI.POV.GINI_WorldBank_2019 is set to False?

nheeren commented 4 years ago

Two more (technically unrelated) issues:

3_MC_Buildings_Heeren_Fishman_ScientificData_2019_V1.1 needs values for dataset_size and comment. I wasn't sure how you calculate the former.
~~6_MIP_SI.POV.GINI_WorldBank_2019 is already in the database (dataset_id = 230). It is not impossible that I accidentally uploaded it, but I want to double-check. Did you upload it already?~~ My log actually clearly states this.

stefanpauliuk commented 4 years ago

The feature in the G5 cell is used to tell the script whether or not table cells with no data should be inserted, i.e., whether the data provider sais that this is valuable information or not:

ad 1) the data point is not inserted at all! (no entry): if G5 == True: insert empty data cells as NULL into the data table if G5 == False OR G5 is empty: No entry for this data point.

ad 2) No, just do not insert this data point. The reason for that is that when the TABLE format is used for reasons of convenience, there may be cells that were not present in the original datasets, and then, these cells also should not be inserted.

ad 3) yes.

stefanpauliuk commented 4 years ago

For the other questions:

For the dataset_size field, I have a script that scans through all datasets and updates the dataset size, so that is fine! You can just enter 1 here or so, I will then run that script tomorrow.

For dataset id 230 et al.: Yes, some (most?) of the datasets in the current upload folder are already uploaded during the June update, only some datasets are new, including some YSTAFDB datasets. Do you need a list?

nheeren commented 4 years ago

Thannks Stefan.

ad 1) the data point is not inserted at all! (no entry): if G5 == True: insert empty data cells as NULL into the data table if G5 == False OR G5 is empty: No entry for this data point.

OK. I see now. However, this does does create some ambiguity nevertheless. Right now there are already 17k values with NULL in the data table. We would need to make the second case the default (if nothing is specified, i.e. in the old format). The best would really be to use some kind of flag to make it clear that we are dealing with "intentional null values". That would mean we would need to change value from float to string in the future. Right now we don't really have any options for such a flag. Some people use some "magic" number like -1 or -9999, but I would strongly object to doing that.

For the dataset_size field, I have a script that scans through all datasets and updates the dataset size, so that is fine! You can just enter 1 here or so, I will then run that script tomorrow.

OK. Will use 1, because the column has a NOT NULL constraint.

For dataset id 230 et al.: Yes, some (most?) of the datasets in the current upload folder are already uploaded during the June update, only some datasets are new, including some YSTAFDB datasets. Do you need a list?

Hmm, not sure if we are misunderstanding each other. I am uploading 'G7 RECC/Other/IE_Data_Commons_Prototype/NEW_TABLE'. From those only 3_MC_Buildings_Heeren_Fishman_ScientificData_2019_V1.1 is still missing and 6_MIP_SI.POV.GINI_WorldBank_2019 I will have to delete and re-upload because it was uploaded with all the NULL values.

stefanpauliuk commented 4 years ago

Yes, it does create ambiguity, and the best is to use the COMMENT attribut of each individual data point to indicate any details about the null or other special values. Agree that -1 or -9999 is nonsense.

For the addtional data: I completely forgot to upload them to the DB, here they come: \Dropbox\G7 RECC\Other\IE_Data_Commons_Prototype\NEW_TABLE_2 It's another 11 datasets, the last of this update round. Including MetabolismofCities and IRP MFA data... and parts of the YSTAFDB!

nheeren commented 4 years ago

Yes, it does create ambiguity, and the best is to use the COMMENT attribut of each individual data point to indicate any details about the null or other special values.

Yes for now. In the long run it would be better to have a system that minimizes ambiguities. Exceptions in a database create unnecessary barriers. Imagine you would want to do data exploration or comparisons of many different datasets in an automated way. Something for IEDC 2.0 😃

For the addtional data: I completely forgot to upload them to the DB, here they come: \Dropbox\G7 RECC\Other\IE_Data_Commons_Prototype\NEW_TABLE_2 It's another 11 datasets, the last of this update round. Including MetabolismofCities and IRP MFA data... and parts of the YSTAFDB!

All of those have already been uploaded, correct?

stefanpauliuk commented 4 years ago

No, the data templates in NEW_TABLE_2 were created in November and are not uploaded yet.

nheeren commented 4 years ago

What happens if G5 == False, a value is null, but a comment exists?

Will skip for now.

stefanpauliuk commented 4 years ago

Yes, best to skip. Which dataset is it?

nheeren commented 4 years ago

3_MC_Buildings_Heeren_Fishman_ScientificData_2019_V1.1 (which just uploaded successfully).

stefanpauliuk commented 4 years ago

Passt so! Denn die Comments hier sind via copy-paste und drag and drop. entstanden, und können dann für die empty-Zellen ignoriert werden.

nheeren commented 4 years ago

6_MIP_SI.POV.GINI_WorldBank_2019 deleted & re-uploaded successfully. This closes this issue. Will now look into NEW_TABLE_2.

nheeren commented 4 years ago

Find further updates in my log: https://github.com/IndEcol/IE_data_commons/blob/master/doc/File_changes_Niko.md#18-dec-2019

IndEcol / IE_data_commons

How to deal with empty data cells in LIST or TABLE data templates #21