Closed IKnowLogic closed 4 years ago
I like these changes. I only wonder about how to generalize them to other data sources. Is the argument list for generate_provenance_uris
just going to grow and grow?
Well spotted @bkuczenski . That method should produce URIs for source datasets and for organization (right @IKnowLogic ?) So probably we should have some default way to traverse the current directory tree and find some files containing all the metadata to be converted. So that the method does not take any argument, but retrieves the information from various files. Hence, when a new dataset arrives we need to add the metadata for it in a companion file. Could that work?
Also, this pull request should address
https://github.com/BONSAMURAIS/BONSAI-ontology-RDF-framework/issues/5
You are absolutely right @kuzeko, the generate_procenance_uris
method is for producing URIs for source datasets and organizations. Ideally it should take no arguments, but traverse a a file containing metadata for organizations and the datasets we use.
Currently we only implement provenance for the entities (flows, activities, activititypes and locations) coming from the Exiobase dataset so the file (along with arguments) will first start to grow when we introduce more datasets @bkuczenski.
This is but an initial implementation to give a proof of concept. The next iteration will take into account your considerations @kuzeko and @bkuczenski. I would be glad to help out with the implementation of the next version.
The main reason for the pull request is to add provenance information to flowObjects, activityTypes, and locations extracted from the exiobase dataset. Broadly speaking provenance is used to register the origin of digital artifacts. Provenance has many usages, such as determining ownership and rights over artifacts, determining whether artifacts can be trusted, determining whether correct methods have been utilized to obtain a result, as well as determining how an artifact was produced. It is important to notice that this version of provenance is only a temporary first solution to the problem of missing provenance information. When the time comes it needs to be extended to encompass the inclusion of multiple datasets.
All changes are in the Arborist script, as provenance is best captured in the process of data extraction from the original datasets. Following is an explanation of which python files have been changed, along with their reflections in the turtle file.
foaf.py
: Added exiobase consortium as an organization. Also added meta-information about the file itself in accordance with Error 3 of the RDF repo (https://github.com/BONSAMURAIS/rdf/issues/3)provenance_uris.py
: The file is an addition to the existing file structure. It creates a new TTL file calledprov.ttl
, which includes provenance information for the origin of our used datasets (EXIOBASE). It also contains provenance information about the version of arborist used in the process of generating all TTL files.graph_common.py
: Added a line of code to provide provenance membership information between a collection of entities and the entities themselves. This provides the means for us to track the lineage of individual FlowObjects, ActivityTypes, and Locations to their respective origin datasets.exiobase_metadata.py
: Added a new section for provenance information. The addition only executes a method from theprovenance_uris, in itself, this does not change any TTL files.When a new version of the RDF store is to be deployed, three values must be changed in the file "exiobase_metadata". In the function "generate_provenance_uris", the exiobase_version and the arborist_version must be updated. Futhermore a tag for the new commit should be added, which is linked in the
prov.ttl
file. (e.g., "https://github.com/BONSAMURAIS/arborist/tree/v0_3" for version 0_3).We currently only add provenance for exiobase, not us_epa, entsoe etc.