v2.1.0 - Githubissues

MBueschelberger commented 2 days ago

Previously, the mapping schema for individuals with custom relations was not very effective and very repetitive if an individual needs e.g. multiple dataproperties from a data file.

In order to produce a graph like this...

@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix ns1: <https://w3id.org/steel/ProcessOntology/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix chameo: <https://w3id.org/emmo/domain/characterisation-methodology/chameo#> .
@prefix nanoindentation: <https://w3id.org/emmo/domain/domain-nanoindentation/nanoindentation#> .

nanoindentation:John a chameo:Operator ;
    foaf:age 32 ;
    foaf:name "John"^^xsd:string ;
    ns1:hasLaboratory 345 .

nanoindentation:Jane a chameo:Operator ;
    foaf:age 28 ;
    foaf:name "Jane"^^xsd:string ;
    ns1:hasLaboratory 123 .

... mapping like this would have been needed to be applied:

[
      {
          "value_location": "data.name[0]",
          "value_relation": "http://xmlns.com/foaf/0.1/name",
          "iri": "https://w3id.org/emmo/domain/characterisation-methodology/chameo#Operator",
          "suffix": "Operator1",
      },
      {
          "value_location": "data.age[0]",
          "value_relation": "http://xlsns.com/foaf/0.1/age",
          "iri": "https://w3id.org/emmo/domain/characterisation-methodology/chameo#Operator",
          "suffix": "Operator1",
      },
      {
          "value_location": "data.lab_no[0]",
          "value_relation": "https://w3id.org/steel/ProcessOntology/hasLaboratory",
          "iri": "https://w3id.org/emmo/domain/characterisation-methodology/chameo#Operator",
          "suffix": "Operator1",
      },
      {
          "value_location": "data.name[1]",
          "value_relation": "http://xmlns.com/foaf/0.1/name",
          "iri": "https://w3id.org/emmo/domain/characterisation-methodology/chameo#Operator",
          "suffix": "Operator2",
      },
      {
          "value_location": "data.age[1]",
          "value_relation": "http://xlsns.com/foaf/0.1/age",
          "iri": "https://w3id.org/emmo/domain/characterisation-methodology/chameo#Operator",
          "suffix": "Operator2",
      },
      {
          "value_location": "data.lab_no[1]",
          "value_relation": "https://w3id.org/steel/ProcessOntology/hasLaboratory",
          "iri": "https://w3id.org/emmo/domain/characterisation-methodology/chameo#Operator",
          "suffix": "Operator2",
      },
  ]

... on a dataset shaped like this:

   {
    "data": [
        {
            "name": "Jane",
            "age": 28,
            "lab_no": 123,
        },
        {
            "name": "John",
            "age": 32,
            "lab_no": 345,
        },
    ]
}

However, with this PR, the schema can now be more simplified:

 [
    {
        "iri": "https://w3id.org/emmo/domain/characterisation-methodology/chameo#Operator",
        "suffix": "name",
        "source": "data[*]",
        "suffix_from_location": True,
        "custom_relations": [
            {
                "object_location": "name",
                "relation": "http://xmlns.com/foaf/0.1/name",
            },
            {
                "object_location": "age",
                "relation": "http://xmlns.com/foaf/0.1/age",
            },
            {
                "object_location": "lab_no",
                "relation": "https://w3id.org/steel/ProcessOntology/hasLaboratory",
            },
        ],
    }
]

Please note that the dataset now can have as many individuals as needed since we are able to apply a wildcard now (data[*]). The suffix of the individual is also retrieved from the dataset once suffix_from_location is set to True. If set to False, simply the provided value from the suffix key will be taken.

If source is set, the object_location will be treated as a relative path of the root objects iterated from the data[*].

If source is not set, the object_location will be treated as absolute path. Same also applies for the suffix, when suffix_from_location is set to True.

See the updated docs here: https://github.com/MI-FraunhoferIWM/data2rdf/blob/enh/mapping-for-multiple-individuals/docs/examples/abox/6_custom_relations.md

github-actions[bot] commented 2 days ago

Coverage Report

File	Stmts	Miss	Cover	Missing
data2rdf
__init__.py	5	0	100%
config.py	19	0	100%
utils.py	33	5	5	85%
warnings.py	2	0	100%
data2rdf/models
__init__.py	3	0	100%
base.py	47	4	4	91%
graph.py	150	35	35	77%
mapping.py	40	1	1	98%
data2rdf/modes
__init__.py	4	0	100%
data2rdf/parsers
__init__.py	6	0	100%
base.py	134	11	11	92%
csv.py	168	20	20	88%
excel.py	175	17	17	90%
json.py	188	29	29	85%
utils.py	79	11	11	86%
data2rdf/pipelines
__init__.py	2	0	100%
main.py	82	9	9	89%
data2rdf/qudt
__init__.py	0	0	100%
utils.py	42	12	12	71%
TOTAL	1179	154	87%

Tests	Skipped	Failures	Errors	Time
114	0 :zzz:	0 :x:	0 :fire:	2m 56s :stopwatch:

yoavnash commented 2 days ago

Seems to make sense for JSON but would that also work CSV or Excel files? Is the old format still supported?

Kirankumaraswamy commented 2 days ago

Looks good to me. Does the changes also distinguishes if the object is going to be a literal or a URIREF object? For example if the data has an attribute hasOrganization and the value will be an IRI of a kitem.

MBueschelberger commented 2 days ago

Seems to make sense for JSON but would that also work CSV or Excel files? Is the old format still supported?

It is also supported for Excel. However, the wildcard through source is not working there, since you cannot apply jsonpath to excel.

Implementing it for CSV is a bit more complicated since the overall parser works differently. Hence CSV is currently not supported.

The old schema is still supported. The only difference is that if custom_relations is set, the other fields like value_location and value_relation, unit_location and unit_relation are disabled.

MBueschelberger commented 2 days ago

Looks good to me. Does the changes also distinguishes if the object is going to be a literal or a URIREF object? For example if the data has an attribute hasOrganization and the value will be an IRI of a kitem.

As already mentioned in the attached link to the docs above, you are able to set the xsd-type with the object_data_type field:

...
            {
                "object_location": "lab_no",
                "relation": "https://w3id.org/steel/ProcessOntology/hasLaboratory",
                "object_data_type": "anyUri",
            },
...

yoavnash commented 2 days ago

Seems to make sense for JSON but would that also work CSV or Excel files? Is the old format still supported?

It is also supported for Excel. However, the wildcard through source is not working there, since you cannot apply jsonpath to excel.

Implementing it for CSV is a bit more complicated since the overall parser works differently. Hence CSV is currently not supported.

The old schema is still supported. The only difference is that if custom_relations is set, the other fields like value_location and value_relation, unit_location and unit_relation are disabled.

Is it then the case that data2rdf throws an error or a warning when a user tries it in a way that is not supported?

MBueschelberger commented 2 days ago

Seems to make sense for JSON but would that also work CSV or Excel files? Is the old format still supported?

It is also supported for Excel. However, the wildcard through source is not working there, since you cannot apply jsonpath to excel. Implementing it for CSV is a bit more complicated since the overall parser works differently. Hence CSV is currently not supported. The old schema is still supported. The only difference is that if custom_relations is set, the other fields like value_location and value_relation, unit_location and unit_relation are disabled.

Is it then the case that data2rdf throws an error or a warning when a user tries it in a way that is not supported?

Yes it does!

https://github.com/MI-FraunhoferIWM/data2rdf/blob/c24164dfa6f5d9da7f097118a448965c8112d17c/data2rdf/models/mapping.py#L116

MI-FraunhoferIWM / data2rdf

v2.1.0 #68