How to handle stats_array values?

IndEcol / IE_data_commons

Code and documentation for a commons of structured industrial ecology data

MIT License

22 stars 2 forks source link

How to handle stats_array values? #14

Closed nheeren closed 5 years ago

nheeren commented 5 years ago

The file 3_LT_AluCycle_LIU_2012.xlsx contains values for stats_arrays, e.g. "3;Value;0.3*Value;none;none;none". How should this be uploaded to the DB. The table 'data' has the columns 'stats_array_1' to 'stats_array_4'. So this is 6 values vs. 6. Also the columns in 'data' are integers.

The following input files are affected: '3_LT_AluCycle_LIU_2012.xlsx', '3_LT_AluCycle_LIU_2013.xlsx', '3_LT_IAI_GARC_2011.xlsx', '3_LT_MetalDemand_DEETMAN_2018.xlsx', '3_LT_SteelCycle_PAULIUK_2013.xlsx', '3_MC_SteelDemand_HU_2010.xlsx'

stefanpauliuk commented 5 years ago

It means the following: stats_array_1 = 3 stats_array_2 = Value (the value of the data item) stats_array_3 = 0.3 * Value stats_array_4 = 0/none the other two 'none's don't matter, in a very earliy version, stats_arrray had 6 entries, but I squeezed them together into 4 columns to reduce db memory usage.

How to upload: if stats_array_2 = 'Value': # or the part 2 of the parsing string that is 'Value' stats_array_2 = [thisvalue] if stats_array_2 = '0.3 Value': # or the part 3 of the parsing string that is '0.3 Value' stats_array_2 = 0.3 * [thisvalue]

That is not very professional. Do you know whether there is an easy way to recognize and parse simple equations strings (+-*/) in Python? Data suppliers should be able to write them down into the templates.

For now, I changed strings so that current parser can upload these data:

E.g., String "3;10;3.0;none;" must be parsed into stats_array_1 = 3 stats_array_2 = 10 stats_array_3 = 3.0 stats_array_4 = 0/none

nheeren commented 5 years ago

This means I will have to write a little parser.

stats_array_4 = 0/none

So should this be 0 or none?

the other two 'none's don't matter, in a very earliy version, stats_arrray had 6 entries, but I squeezed them together into 4 columns to reduce db memory usage.

I think having two columns with a short string more or less wouldn't blow our DB out of proportions

That is not very professional. Do you know whether there is an easy way to recognize and parse simple equations strings (+-*/) in Python? Data suppliers should be able to write them down into the templates.

One could simply use Python notation. Then it can be processed with eval(), e.g.:

s = "1 + 3 * 4 ** 2"
eval(s)
Out[3]: 
49

E.g., String "3;10;3.0;none;" must be parsed into

I propose to drop the last semicolon as it confuses man and machine :)

nheeren commented 5 years ago

For now, I changed strings so that current parser can upload these data:

Does that mean no more calculation (e.g. 0.3 * Value) necessary?

I think this would be the better approach.

stefanpauliuk commented 5 years ago

In the example above: stats_array_4 = none eval looks like the right thing to use, but for now, the data templates have all values already calculated. No more calculation needed at this point. Sure, let's drop the last ; Anyway, only the first 4 segments returned by uncertaintystring.split(';') matter.

nheeren commented 5 years ago

Parser implemented in https://github.com/IndEcol/IEDC_tools/commit/720ce48ff783e3ba2492784769dc2f4a98600ea5