Provenance config file - Githubissues

IKnowLogic commented 4 years ago

Purpose of Pull Request Enhancement change relating to using a configuration file for data providers and datasets. Corrected error relating to origin provenance.

Enhancement All data providers and datasets should now be added to a config.json file. This makes it much easier to administer the provenance of these. The file config_parser is used to parse and check the config file for various requirements and limitations, such as the uniqueness of names for data providers.

Origin of provenance Since we would like to track from which sources (datasets and providers) individual pieces of data stem from, we need perfect and unique tracing of this information. Before, each dataset was related to a single rdf entity (arborist_script) representing the arborist script, which again was related to the named graphs extracted the dataset. Below is a (rough) sketch:

This, unfortunately, poses a problem, since, with multiple input datasets, we are no longer able to perfectly trace the original datasets for triples in the extracted named graphs. We, therefore, have to change the design to allow for perfect tracing. The new implementation follows the example sketch below:

Each dataset is now associated with a unique rdf entity, representing the extraction activity used to create named graphs from a dataset.

Currently, the implementation is rough due to time constraints but will be issued in a later pull request.

kuzeko commented 4 years ago

When you say

Currently, the implementation is rough due to time constraints but will be issued in a later pull request.

What do you mean exactly?

IKnowLogic commented 4 years ago

@kuzeko Referring to rough I am only speaking of syntactic sugar. The Implementation should work just fine but needs refactoring at some point.

As an example the config parser is not very beautiful, and should perhaps be converted into a class instead of a small library of functions, I at least think that would be better.

kuzeko commented 4 years ago

@IKnowLogic I see, that's fine. What about the other comments, are they addressed?

IKnowLogic commented 4 years ago

Is this all the information needed? Like what about single files, like

https://github.com/BONSAMURAIS/arborist/blob/039721ce42b820d406c4c68acd1160faa56f47ef/arborist/exiobase_metadata.py#L17

Or this information

https://github.com/BONSAMURAIS/arborist/blob/039721ce42b820d406c4c68acd1160faa56f47ef/arborist/exiobase_metadata.py#L30-L32

Also, how do we include information for these files: https://github.com/BONSAMURAIS/arborist/blob/provenance-config-file/arborist/exiobase_us_epa.py https://github.com/BONSAMURAIS/arborist/blob/provenance-config-file/arborist/entsoe.py

@kuzeko Regarding the above comment. I think you are right about the extra information like description, title and author could be part of the config file for datasets. I was only thinking in the direction of provenance when creating the attributes currently found in the config file.

Should single files ever be used? The case you are pointing at is actually just an aggregation of information found from the exiobase dataset, this is also why we do not attribute it, but the exiobase dataset.

In the case with the datasets entsoe and exiobase_us_epa I simply do not know what to do yet. It seems like this whole repo needs some restructuring and firm rules for how to add new datasets. Best case scenario would be the only place you need to add information about datasets is in the config file.

IKnowLogic commented 4 years ago

Should this information be in the metadata/config file?

@kuzeko I am not sure what you are referring to here.

kuzeko commented 4 years ago

@IKnowLogic ok, let's move title, description, etc to the config file.

Let's discuss options for the "single file" thing, let's draft some guidelines based on what we did with Exiobase, and let's open an issue for missing information in the other two entsoe and us_epa

IKnowLogic commented 4 years ago

I'm referring to exiobase_version = "3.3.17" not sure why the comment is not pointing right

@kuzeko Correct, this is already part of the config.json file.

IKnowLogic commented 4 years ago

@IKnowLogic ok, let's move title, description, etc to the config file.

Let's discuss options for the "single file" thing, let's draft some guidelines based on what we did with Exiobase, and let's open an issue for missing information in the other two entsoe and us_epa

@kuzeko sorry I was too fast here. Probably title, description, author etc. Should not be part of the configuration file, since these relate to specifically named graphs, and does not relate to the entire dataset. We might, however, be able to do something about this information when restructuring the way we incorporate datasets in the future. For now I think we should let them be where they are.

I can write up the issues for the missing information, but I do not have enough information about our current way of integrating new datasets, to be able to write up a guideline... The information I am missing is about the data files exiobase_classifications_v_3_3_17.xlsx etc. How did they come into existence? Since this is a crucial step for the integration of new datasets.

kuzeko commented 4 years ago

@IKnowLogic I understand, it is just odd that we have partial information in the config file and partial information in the python then.

Anyway, let's keep it this way for now, and when we integrate new datasets we come back to this issue.

One thing though, in the config

https://github.com/BONSAMURAIS/arborist/blob/039721ce42b820d406c4c68acd1160faa56f47ef/arborist/data/config.json#L52-L63

there is some dummy test data, I believe it would be better for this to be in your local dev branch and not in master

kuzeko commented 4 years ago

This will be achieved with a different code structure.

BONSAMURAIS / arborist

Provenance config file #17