Closed stefanpauliuk closed 4 years ago
Note to self: Also look into https://github.com/IndEcol/IE_data_commons/issues/23#issuecomment-507737192, i.e. 6_MIP_SI.POV.GINI_WorldBank_2019
There are now a few TABLE data templates that use this feature (cell G5).
if G5 == True: insert empty data cells as NULL into the data table if G5 == False OR G5 is empty: don't insert empty data cells as NULL into the data table
I'd need a short feedback:
3_MC_Buildings_Heeren_Fishman_ScientificData_2019_V1.1
and 6_MIP_SI.POV.GINI_WorldBank_2019
is set to False
?Two more (technically unrelated) issues:
3_MC_Buildings_Heeren_Fishman_ScientificData_2019_V1.1
needs values for dataset_size
and comment
. I wasn't sure how you calculate the former.6_MIP_SI.POV.GINI_WorldBank_2019
is already in the database (dataset_id = 230). It is not impossible that I accidentally uploaded it, but I want to double-check. Did you upload it already?The feature in the G5 cell is used to tell the script whether or not table cells with no data should be inserted, i.e., whether the data provider sais that this is valuable information or not:
ad 1) the data point is not inserted at all! (no entry): if G5 == True: insert empty data cells as NULL into the data table if G5 == False OR G5 is empty: No entry for this data point.
ad 2) No, just do not insert this data point. The reason for that is that when the TABLE format is used for reasons of convenience, there may be cells that were not present in the original datasets, and then, these cells also should not be inserted.
ad 3) yes.
For the other questions:
For the dataset_size field, I have a script that scans through all datasets and updates the dataset size, so that is fine! You can just enter 1 here or so, I will then run that script tomorrow.
For dataset id 230 et al.: Yes, some (most?) of the datasets in the current upload folder are already uploaded during the June update, only some datasets are new, including some YSTAFDB datasets. Do you need a list?
Thannks Stefan.
ad 1) the data point is not inserted at all! (no entry): if G5 == True: insert empty data cells as NULL into the data table if G5 == False OR G5 is empty: No entry for this data point.
OK. I see now. However, this does does create some ambiguity nevertheless. Right now there are already 17k values with NULL in the data table. We would need to make the second case the default (if nothing is specified, i.e. in the old format). The best would really be to use some kind of flag to make it clear that we are dealing with "intentional null values". That would mean we would need to change value
from float to string in the future. Right now we don't really have any options for such a flag. Some people use some "magic" number like -1 or -9999, but I would strongly object to doing that.
For the dataset_size field, I have a script that scans through all datasets and updates the dataset size, so that is fine! You can just enter 1 here or so, I will then run that script tomorrow.
OK. Will use 1, because the column has a NOT NULL constraint.
For dataset id 230 et al.: Yes, some (most?) of the datasets in the current upload folder are already uploaded during the June update, only some datasets are new, including some YSTAFDB datasets. Do you need a list?
Hmm, not sure if we are misunderstanding each other. I am uploading 'G7 RECC/Other/IE_Data_Commons_Prototype/NEW_TABLE'. From those only 3_MC_Buildings_Heeren_Fishman_ScientificData_2019_V1.1
is still missing and 6_MIP_SI.POV.GINI_WorldBank_2019
I will have to delete and re-upload because it was uploaded with all the NULL values.
Yes, it does create ambiguity, and the best is to use the COMMENT attribut of each individual data point to indicate any details about the null or other special values. Agree that -1 or -9999 is nonsense.
For the addtional data: I completely forgot to upload them to the DB, here they come: \Dropbox\G7 RECC\Other\IE_Data_Commons_Prototype\NEW_TABLE_2 It's another 11 datasets, the last of this update round. Including MetabolismofCities and IRP MFA data... and parts of the YSTAFDB!
Yes, it does create ambiguity, and the best is to use the COMMENT attribut of each individual data point to indicate any details about the null or other special values.
Yes for now. In the long run it would be better to have a system that minimizes ambiguities. Exceptions in a database create unnecessary barriers. Imagine you would want to do data exploration or comparisons of many different datasets in an automated way. Something for IEDC 2.0 😃
For the addtional data: I completely forgot to upload them to the DB, here they come: \Dropbox\G7 RECC\Other\IE_Data_Commons_Prototype\NEW_TABLE_2 It's another 11 datasets, the last of this update round. Including MetabolismofCities and IRP MFA data... and parts of the YSTAFDB!
All of those have already been uploaded, correct?
No, the data templates in NEW_TABLE_2 were created in November and are not uploaded yet.
What happens if G5 == False, a value is null, but a comment exists?
Will skip for now.
Yes, best to skip. Which dataset is it?
3_MC_Buildings_Heeren_Fishman_ScientificData_2019_V1.1
(which just uploaded successfully).
Passt so! Denn die Comments hier sind via copy-paste und drag and drop. entstanden, und können dann für die empty-Zellen ignoriert werden.
6_MIP_SI.POV.GINI_WorldBank_2019
deleted & re-uploaded successfully. This closes this issue. Will now look into NEW_TABLE_2.
Find further updates in my log: https://github.com/IndEcol/IE_data_commons/blob/master/doc/File_changes_Niko.md#18-dec-2019
Solution: On the 'cover' sheet of the LIST and TABLE data templates, cell F5, a new Flag Insert_Empty_Cells_as_NULL can (but does not have to) be added to the template. The corresponding value for this flag is listed in cell G5 on the same sheet.
if G5 == True: insert empty data cells as NULL into the data table if G5 == False OR G5 is empty: don't insert empty data cells as NULL into the data table
For an example of this new flag in a working template, cf. 3_MC_Buildings_Heeren_Fishman_ScientificData_2019_V1.1.xlsx