IndEcol / IE_data_commons

Code and documentation for a commons of structured industrial ecology data
MIT License
22 stars 2 forks source link

How to identify the correct attribute? #8

Closed nheeren closed 5 years ago

nheeren commented 6 years ago

Probably I am overlooking something, but where can I figure out what attribute number is being used? It is given in the template file, but I cannot find this information in the database itself.

stefanpauliuk commented 6 years ago

For the attribute names: the numbers refer to the attribute name columns in classification_items: Template: ' Aspects_Attribute_No' = 1 means that the classification item for this aspect given in the data sheet must match an entry of the 'attribute1' column of classification_items for the given classification. Aspects_Attribute_No' = 3 -> match with the items in the classification_items.attribute3 column See examples in https://github.com/IndEcol/IE_data_commons/tree/master/IEDC_content_fill/Dataset_Upload

The reason for this feature is that we want people to supply their data with the classifications items that they find convenient. If someone has a dataset with region ISO codes and our database already has a mapping between ISO codes and country names, it would be tedious to ask the data suppliers to do an ISO-code-name conversion, or other conversions, themselves. Instead, they supply their data in ISO codes or with names, as they wish, and our parser does the matching easily. Same with chemical elements (you can submit datasets with 'iron', 'copper' or 'Fe', 'Cu', or '26', '29').

nheeren commented 6 years ago

Hmm, I see. Probably I was overthinking this. Once the data is in the db, the attribute doesn't matter anymore, because attribute columns in table classification_items are synonymous to each other – correct?

stefanpauliuk commented 6 years ago

Yes and no ;) Attribute columns don't have to be synonymous, they can also contain groupings, e.g., of countries to regions. Right now, we do not distinguish between columns that are synonymous and those that are not, and this is a design flaw. Could have two groups of attribute columns in the future, one that is synonymous to attribute 1 and one that isn't.

Right now, it is the responsibility of the data providers to use attributes that work (attribute 1 always works). I included this feature because I am sick and tired of converting country ISO codes to names and vice versa :) So far, attribute 2 and beyond have been specified for ISO regions and chemical elements only.

nheeren commented 6 years ago

If they are not synonymous the database needs to somehow provide the information on which attribute is correct. As far as I can see this information is only given in the template file and not being transferred.

How to resolve this issue? Make it part of the wishlist?

stefanpauliuk commented 6 years ago

Attribute 1 is always correct as it must be unique for each classification (but can be overlapping). My take on this is that we divide the attributes into those that must be 1:1 and those that can represent aggregations or other information. Let's see during our next call!

tmillross commented 6 years ago

Just a suggestion from browsing this issue and this comment: an entity-attribute-value data model may be useful here to side-step some of these challenges in attribute identification, properties and uniqueness etc. Though I appreciate switching (or even considering it) is not simple

nheeren commented 6 years ago

Thanks @tmillross. Can you give some more reference? What is that? What could it do for us? Would you have some references?

stefanpauliuk commented 6 years ago

EAV data model: https://en.wikipedia.org/wiki/Entity%E2%80%93attribute%E2%80%93value_model

The reason why I did not go for an RDF database or an EAV data model in general is twofold: a) When just compiling data in triples, you get to add a lot of attributes but you don't see much of the structure of the data. It's too bazar-like for the main purpose of the IEDC, which is being a platform for exchanging data that a structure to be allocated in the industrial system. The IEDC data model shows that the different datasets we use are related to each other in that they fit into a common data structure. b) The EAV or RDF toolchain is much less handy than the one for SQL. To set up a convincing and broadly accessible prototype for RDF you would need a full time software developer, whereas for SQL, the only bottleneck is the web interface, which Mahadi is taking care of using off-the shelf libraries.

Since the SQL database is open, one can download data and convert them to EAV for further testing anyway, so this road is not blocked but I don't want to walk it down.

tmillross commented 6 years ago

I understand your reasoning Stefan: the model is intended as illustration of the relationships between data structures used for different methodologies. Something that's important to grasp to develop the field!

I'm not proposing to shift away from the SQL database. Rather that it's easy also to build EAV-style tables in relational databases. It's usually a bad idea, but in some cases it makes sense. Particularly when the attributes and/or objects types to be stored are not fully known in advance, and flexibility is desired until the domain knowledge grows. It may simplify data loading, for the time when you hope to receive data submissions from the community. The common structure could still be clear, when coupled with Views on the data which pivot it back to row-column format. But indeed there are plenty of good reasons for your choice.

I could elaborate but don't want to distract. Perhaps I'll prototype this suggestion if it seems like a good idea once more familiar with the rest of the design and intended workflow. Cheers

stefanpauliuk commented 6 years ago

Thanks Tom! "Rather that it's easy also to build EAV-style tables in relational databases. It's usually a bad idea, but in some cases it makes sense. Particularly when the attributes and/or objects types to be stored are not fully known in advance, and flexibility is desired until the domain knowledge grows." --> I agree! Niko also made a similar suggestion for how to break down the multiple attributes.

It's all a question on where to allocate our scarce resources. I think that right now, it's essential to have a working prototype with a graphical interface that contains a collection of datasets that is as diverse as possible. And this we almost have.

From there we can start branching off, and the database structure and engine clearly is a major area for development. But we will need to prioritize on what to improve. Structural changes are expensive, in particular, since they would require us to change all upload and download code. Hence proper testing is needed beforehand to prove that the change leads to a major improvement, and that can easily mean several months of working time.

nheeren commented 5 years ago

Closing issue as it has become too broad in this discussion. We might come back to parts of it at a later stage.