BONSAMURAIS / arborist

Generate the URIs needed for the BONSAI knowledge graph
BSD 3-Clause "New" or "Revised" License
0 stars 1 forks source link

Provenance #14

Closed IKnowLogic closed 4 years ago

IKnowLogic commented 4 years ago

The main reason for the pull request is to add provenance information to flowObjects, activityTypes, and locations extracted from the exiobase dataset. Broadly speaking provenance is used to register the origin of digital artifacts. Provenance has many usages, such as determining ownership and rights over artifacts, determining whether artifacts can be trusted, determining whether correct methods have been utilized to obtain a result, as well as determining how an artifact was produced. It is important to notice that this version of provenance is only a temporary first solution to the problem of missing provenance information. When the time comes it needs to be extended to encompass the inclusion of multiple datasets.

All changes are in the Arborist script, as provenance is best captured in the process of data extraction from the original datasets. Following is an explanation of which python files have been changed, along with their reflections in the turtle file.

foaf.py: Added exiobase consortium as an organization. Also added meta-information about the file itself in accordance with Error 3 of the RDF repo (https://github.com/BONSAMURAIS/rdf/issues/3)

Changes are reflected in `foaf.ttl`

provenance_uris.py: The file is an addition to the existing file structure. It creates a new TTL file called prov.ttl, which includes provenance information for the origin of our used datasets (EXIOBASE). It also contains provenance information about the version of arborist used in the process of generating all TTL files.

The change is reflected in the creation of a new TTL file called `prov.ttl`

graph_common.py: Added a line of code to provide provenance membership information between a collection of entities and the entities themselves. This provides the means for us to track the lineage of individual FlowObjects, ActivityTypes, and Locations to their respective origin datasets.

All TTL files generated from the usage of graph_common receive provenance information using this change. As far as i know this concerns `activitytype/exiobase3_3_17`, `flowobject/exiobase3_3_17` and `location/exiobase3_3_17`

exiobase_metadata.py: Added a new section for provenance information. The addition only executes a method from theprovenance_uris, in itself, this does not change any TTL files.

When a new version of the RDF store is to be deployed, three values must be changed in the file "exiobase_metadata". In the function "generate_provenance_uris", the exiobase_version and the arborist_version must be updated. Futhermore a tag for the new commit should be added, which is linked in the prov.ttl file. (e.g., "https://github.com/BONSAMURAIS/arborist/tree/v0_3" for version 0_3).

We currently only add provenance for exiobase, not us_epa, entsoe etc.

bkuczenski commented 4 years ago

I like these changes. I only wonder about how to generalize them to other data sources. Is the argument list for generate_provenance_uris just going to grow and grow?

kuzeko commented 4 years ago

Well spotted @bkuczenski . That method should produce URIs for source datasets and for organization (right @IKnowLogic ?) So probably we should have some default way to traverse the current directory tree and find some files containing all the metadata to be converted. So that the method does not take any argument, but retrieves the information from various files. Hence, when a new dataset arrives we need to add the metadata for it in a companion file. Could that work?

kuzeko commented 4 years ago

Also, this pull request should address

https://github.com/BONSAMURAIS/BONSAI-ontology-RDF-framework/issues/5

IKnowLogic commented 4 years ago

You are absolutely right @kuzeko, the generate_procenance_uris method is for producing URIs for source datasets and organizations. Ideally it should take no arguments, but traverse a a file containing metadata for organizations and the datasets we use.

Currently we only implement provenance for the entities (flows, activities, activititypes and locations) coming from the Exiobase dataset so the file (along with arguments) will first start to grow when we introduce more datasets @bkuczenski.

This is but an initial implementation to give a proof of concept. The next iteration will take into account your considerations @kuzeko and @bkuczenski. I would be glad to help out with the implementation of the next version.