OpenEnergyPlatform / oekg

Repository for the Open Energy Knowledge Graph (OEKG)
Creative Commons Zero v1.0 Universal
8 stars 0 forks source link

Pipeline for automated KG generation #13

Open stap-m opened 1 year ago

stap-m commented 1 year ago

In the internal OvGU-meeting with @adelmemariani @fabianneuhaus and myself we developed a workflow for an automated KG generation. The task is now to establish the basic pipeline for this KG such that a first version can be created. Semantic enrichment etc. should not be considered at this stage and will be adressed later.

KG and workflow grafik

fabianneuhaus commented 1 year ago

Just a minor comment concerning the RDF pattern in the diagram. I think it is unnecessarily complicated. I would suggest that the pattern should be along the following lines.

oekg:scenario123 a oeo:Scenario. oekg:scenario123 xyz:has_IRI \< address of website on OEP > . oekg:scenario123 xyz:has_record oekg:table456. oekg:table456 a xyz:Table. oekg:table456 xyz:has_IRI \< address of website on OEP > . oekg:table456 is about oeo:entity.

I am not sure about oeo:Scenario, xyz:has_record and xyz:Table entities. Firstly, are the tables associated with a scenario or a scenario projection? Secondly, depending on the answer on the first question, we need a relation that links it to an information entity, namely a table. It is probably a good idea to look at the OBI to see whether we can reuse a relation and a class from them. But regardless of whether we use oeo:Scenario, xyz:has_record and xyz:Table or some other IRIs, the pattern should be correct.

EDIT: Included the line connecting scenario and table to OEP. I am not sure what ontology term for xyz:has_IRI.

adelmemariani commented 1 year ago

Sometimes, datasets contain scenarios:

Screenshot 2022-10-11 at 21 41 41

Screenshot 2022-10-11 at 21 51 59

Also, a scenario usually has many datasets(as input: assumptions, model parameter ..., as output: projections) This makes it difficult to make a pipeline. Besides, dataset values are not easily mappable to OEO concepts because users choosed vague and abbreviated terms.

stap-m commented 1 year ago

Firstly, are the tables associated with a scenario or a scenario projection?

Yes. Currently the connection between tables and scenarios works mainly via the tags in the scenario schema, but in the future this link has to (also) be made via the factsheets/bundles.

Sometimes, datasets contain scenarios:

That means, that there are tables that are used in more than one scenario. But that should be no problem, as far as the assignment exists also outside the tables, right?

adelmemariani commented 1 year ago

That means, that there are tables that are used in more than one scenario. But that should be no problem, as far as the assignment also exists outside the tables, right?

That is also my question: whether or not we have an explicit connection (usable via APIs) between the scenario and its datasets. But 'tags' work for filtering in this case.

fabianneuhaus commented 1 year ago

That means, that there are tables that are used in more than one scenario. But that should be no problem, as far as the assignment exists also outside the tables, right?

No, it should be no problem. At least not for the "dumb and dirty" approach that we are currently following. Our approach contains of going through the content of all tables that are associated with a scenario projection. If an entry is either an OEO term or has been annotated by a third party with an OEO term, we use it as as object in an is-about triple. If it is something else, we try to automatically match it to an OEO term. (In the first approach by simple string matching, at some later stage we can improve that by using more sophisticated approaches.) Since the names of scenarios won't be in the OEO, tables that contain names of other scenarios won't be matched and, thus, ignored. That's ok. Actually, I expect that most of the terms won't be automatically be matchable to something in the OEO, even if we use very sophisticated methods.

adelmemariani commented 1 year ago

As a first step, the following 'dumb and dirty' versions are the results of a pipeline based on simple 'string matching' between values in the tables and OEO concepts:

With IRIs: https://github.com/OpenEnergyPlatform/oekg/blob/Trial_autogenerated_oekg_via_pieline/Dummy_OEKG_With_Senario_Datasets.ttl

With labels: https://github.com/OpenEnergyPlatform/oekg/blob/Trial_autogenerated_oekg_via_pieline/Dummy_OEKG_With_Senario_Datasets_With_Labels.ttl

stap-m commented 1 year ago

The following is the list of 'not assignable terms’ for datasets that belong to KS_2050: https://github.com/OpenEnergyPlatform/oekg/blob/Trial_autogenerated_oekg_via_pieline/not_assignables.txt

Thanks @adelmemariani . Let's continue the discussion here.

Does your script consider synonyms and alternative terms that are given in the OEO? I'm wondering, why PJ wasn't found. It is as annotated as exact synonym of petajoule (OEO_00050006).

adelmemariani commented 1 year ago

Does your script consider synonyms and alternative terms that are given in the OEO? I'm wondering, why PJ wasn't found. It is as annotated as exact synonym of petajoule (OEO_00050006).

:open_mouth: My script was not aware of 'synonyms' so far. Thnaks @stap-m . I will work on it...

adelmemariani commented 1 year ago

By considering the has exact synonym relations, the 'petajoule' and 'PJ' is now mappable and 'PJ' is no longer in the list of unassignable terms: https://github.com/OpenEnergyPlatform/oekg/blob/Trial_autogenerated_oekg_via_pieline/Dummy_OEKG_With_Senario_Datasets_With_Labels_And_IRIs.ttl#L376

The overall result would be much better if we have synonyms for other unassignable terms.

stap-m commented 1 year ago

😮 My script was not aware of 'synonyms' so far. Thnaks @stap-m . I will work on it...

Acutally, we agreed on using alternative term instead of synonyms, but appearently, there are still some artifacts...