MI-FraunhoferIWM / data2rdf

About A generic pipeline that can be used to map raw data to RDF.
BSD 3-Clause "New" or "Revised" License
3 stars 0 forks source link

v2.1.0 #68

Open MBueschelberger opened 2 days ago

MBueschelberger commented 2 days ago

Previously, the mapping schema for individuals with custom relations was not very effective and very repetitive if an individual needs e.g. multiple dataproperties from a data file.

In order to produce a graph like this...

@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix ns1: <https://w3id.org/steel/ProcessOntology/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix chameo: <https://w3id.org/emmo/domain/characterisation-methodology/chameo#> .
@prefix nanoindentation: <https://w3id.org/emmo/domain/domain-nanoindentation/nanoindentation#> .

nanoindentation:John a chameo:Operator ;
    foaf:age 32 ;
    foaf:name "John"^^xsd:string ;
    ns1:hasLaboratory 345 .

nanoindentation:Jane a chameo:Operator ;
    foaf:age 28 ;
    foaf:name "Jane"^^xsd:string ;
    ns1:hasLaboratory 123 .

... mapping like this would have been needed to be applied:

[
      {
          "value_location": "data.name[0]",
          "value_relation": "http://xmlns.com/foaf/0.1/name",
          "iri": "https://w3id.org/emmo/domain/characterisation-methodology/chameo#Operator",
          "suffix": "Operator1",
      },
      {
          "value_location": "data.age[0]",
          "value_relation": "http://xlsns.com/foaf/0.1/age",
          "iri": "https://w3id.org/emmo/domain/characterisation-methodology/chameo#Operator",
          "suffix": "Operator1",
      },
      {
          "value_location": "data.lab_no[0]",
          "value_relation": "https://w3id.org/steel/ProcessOntology/hasLaboratory",
          "iri": "https://w3id.org/emmo/domain/characterisation-methodology/chameo#Operator",
          "suffix": "Operator1",
      },
      {
          "value_location": "data.name[1]",
          "value_relation": "http://xmlns.com/foaf/0.1/name",
          "iri": "https://w3id.org/emmo/domain/characterisation-methodology/chameo#Operator",
          "suffix": "Operator2",
      },
      {
          "value_location": "data.age[1]",
          "value_relation": "http://xlsns.com/foaf/0.1/age",
          "iri": "https://w3id.org/emmo/domain/characterisation-methodology/chameo#Operator",
          "suffix": "Operator2",
      },
      {
          "value_location": "data.lab_no[1]",
          "value_relation": "https://w3id.org/steel/ProcessOntology/hasLaboratory",
          "iri": "https://w3id.org/emmo/domain/characterisation-methodology/chameo#Operator",
          "suffix": "Operator2",
      },
  ]

... on a dataset shaped like this:

   {
    "data": [
        {
            "name": "Jane",
            "age": 28,
            "lab_no": 123,
        },
        {
            "name": "John",
            "age": 32,
            "lab_no": 345,
        },
    ]
}

However, with this PR, the schema can now be more simplified:

 [
    {
        "iri": "https://w3id.org/emmo/domain/characterisation-methodology/chameo#Operator",
        "suffix": "name",
        "source": "data[*]",
        "suffix_from_location": True,
        "custom_relations": [
            {
                "object_location": "name",
                "relation": "http://xmlns.com/foaf/0.1/name",
            },
            {
                "object_location": "age",
                "relation": "http://xmlns.com/foaf/0.1/age",
            },
            {
                "object_location": "lab_no",
                "relation": "https://w3id.org/steel/ProcessOntology/hasLaboratory",
            },
        ],
    }
]   

Please note that the dataset now can have as many individuals as needed since we are able to apply a wildcard now (data[*]). The suffix of the individual is also retrieved from the dataset once suffix_from_location is set to True. If set to False, simply the provided value from the suffix key will be taken.

If source is set, the object_location will be treated as a relative path of the root objects iterated from the data[*].

If source is not set, the object_location will be treated as absolute path. Same also applies for the suffix, when suffix_from_location is set to True.

See the updated docs here: https://github.com/MI-FraunhoferIWM/data2rdf/blob/enh/mapping-for-multiple-individuals/docs/examples/abox/6_custom_relations.md

github-actions[bot] commented 2 days ago

Coverage

Coverage Report
FileStmtsMissCoverMissing
data2rdf
   __init__.py50100% 
   config.py190100% 
   utils.py3355 85%
   warnings.py20100% 
data2rdf/models
   __init__.py30100% 
   base.py4744 91%
   graph.py1503535 77%
   mapping.py4011 98%
data2rdf/modes
   __init__.py40100% 
data2rdf/parsers
   __init__.py60100% 
   base.py1341111 92%
   csv.py1682020 88%
   excel.py1751717 90%
   json.py1882929 85%
   utils.py791111 86%
data2rdf/pipelines
   __init__.py20100% 
   main.py8299 89%
data2rdf/qudt
   __init__.py00100% 
   utils.py421212 71%
TOTAL117915487% 

Tests Skipped Failures Errors Time
114 0 :zzz: 0 :x: 0 :fire: 2m 56s :stopwatch:
yoavnash commented 2 days ago

Seems to make sense for JSON but would that also work CSV or Excel files? Is the old format still supported?

Kirankumaraswamy commented 2 days ago

Looks good to me. Does the changes also distinguishes if the object is going to be a literal or a URIREF object? For example if the data has an attribute hasOrganization and the value will be an IRI of a kitem.

MBueschelberger commented 2 days ago

Seems to make sense for JSON but would that also work CSV or Excel files? Is the old format still supported?

It is also supported for Excel. However, the wildcard through source is not working there, since you cannot apply jsonpath to excel.

Implementing it for CSV is a bit more complicated since the overall parser works differently. Hence CSV is currently not supported.

The old schema is still supported. The only difference is that if custom_relations is set, the other fields like value_location and value_relation, unit_location and unit_relation are disabled.

MBueschelberger commented 2 days ago

Looks good to me. Does the changes also distinguishes if the object is going to be a literal or a URIREF object? For example if the data has an attribute hasOrganization and the value will be an IRI of a kitem.

As already mentioned in the attached link to the docs above, you are able to set the xsd-type with the object_data_type field:

...
            {
                "object_location": "lab_no",
                "relation": "https://w3id.org/steel/ProcessOntology/hasLaboratory",
                "object_data_type": "anyUri",
            },
...
yoavnash commented 2 days ago

Seems to make sense for JSON but would that also work CSV or Excel files? Is the old format still supported?

It is also supported for Excel. However, the wildcard through source is not working there, since you cannot apply jsonpath to excel.

Implementing it for CSV is a bit more complicated since the overall parser works differently. Hence CSV is currently not supported.

The old schema is still supported. The only difference is that if custom_relations is set, the other fields like value_location and value_relation, unit_location and unit_relation are disabled.

Is it then the case that data2rdf throws an error or a warning when a user tries it in a way that is not supported?

MBueschelberger commented 2 days ago

Seems to make sense for JSON but would that also work CSV or Excel files? Is the old format still supported?

It is also supported for Excel. However, the wildcard through source is not working there, since you cannot apply jsonpath to excel. Implementing it for CSV is a bit more complicated since the overall parser works differently. Hence CSV is currently not supported. The old schema is still supported. The only difference is that if custom_relations is set, the other fields like value_location and value_relation, unit_location and unit_relation are disabled.

Is it then the case that data2rdf throws an error or a warning when a user tries it in a way that is not supported?

Yes it does!

https://github.com/MI-FraunhoferIWM/data2rdf/blob/c24164dfa6f5d9da7f097118a448965c8112d17c/data2rdf/models/mapping.py#L116