SDM-TIB / SDM-RDFizer

An Efficient RML-Compliant Engine for Knowledge Graph Construction
https://doi.org/10.5281/zenodo.3872103
Apache License 2.0
107 stars 25 forks source link

Question about missing values #67

Closed markwilkinson closed 2 years ago

markwilkinson commented 2 years ago

Question, not a bug report:

In many of our datasets there are missing values. I wasn't able to find a description of how this situation should be handled in the RML documentation, so I am wondering what decisions you made in SDM RDFizer. In particular, I am wondering if it is possible to pass parameters into the transformation that will allow me to choose the desired behavior. For example, in almost all cases, if the value is missing from the data, then I would prefer that no triples are created that involve that datapoint. i.e. if the datapoint is only used in the Object of the triple, I would want the RDFizer to not generate ANY portion of the triple - no S and no P either.

Is this possible?

Also, please feel free to add another H2020 project to your homepage! We are using SDM RDFizer in the Virtual Platform for the H2020 European Joint Programme on Rare Disease (https://www.ejprarediseases.org/)

Best wishes all!

Mark

eiglesias34 commented 2 years ago

Hello Mark,

As always, thank you for using the SDM-RDFizer. I hope you are doing well in this difficult time. The main philosophy that the SDM-RDFizer follows is that each RDF resource is generated independently. Meaning that the triple itself is not generated unless the resources corresponding to the subject, predicate, and object are created. So, if the value associated with the attribute is None/missing the resource is not generated, and by extension the triple. I am well aware that tools like RMLMapper generate triples regardless if the value is present or not, but we do not.

Thank you for allowing us to add your project to the list of projects that use SDM-RDFizer.

Cheers,

Enrique

markwilkinson commented 2 years ago

Hi Enrique,

thank you for the rapid response!

Is that true when the missing CSV value is only a component of the S or P or O? for example:

http://my.server.org/PID{PID}/data

Where a missing value in the PID column would result in a perfectly valid URI (http://my.server.org/PID/data) but an invalid URI with respect to my dataset.

?

Mark

eiglesias34 commented 2 years ago

The SDM-RDFizer wouldn't generate the URL even if the URL itself is valid since the generated URL is invalid with respect to the input dataset. The main idea behind the SDM-RDFizer is to generate a KG that represents as close as possible what is established in the mapping and raw data. Generating a URL when the value is missing would violate that.

markwilkinson commented 2 years ago

Perfect! Thank you!

dachafra commented 2 years ago

@markwilkinson Indeed, SDM-RDFizer follows RML spec, and usually if something is not declared in that specification, my recommendation is to go to the R2RML one, because many behaviors are defined there.

For the case you are asking, the information where is defined the behavior is in https://www.w3.org/TR/r2rml/#generated-triples. More in detail, the spec says:

Add triples to the output dataset is a process that takes the following inputs:
- Subject, an IRI or blank node or empty
- Predicate, an IRI or empty
- Object, an RDF term or empty
- Target graphs, a set of zero or more IRIs

Execute the following steps:
- If Subject, Predicate or Object is empty, then abort these steps.
- ....

Remember that Subject, Predicate, and Object are generated by their corresponding "Maps" that are subclasses from the TermMap. And regarding a TermMap generation, the spec mentions in https://www.w3.org/TR/r2rml/#generated-rdf-term:

A term map is a function that generates an RDF term from a logical table row. The result of that function can be:
- Empty – if any of the referenced columns of the term map has a NULL value,
- ....

And we interpret empty cells as NULL values for CSV files.

And finally, we would be really happy if you consider adding the use-case of the project to the W3C CG in Knowledge Graph Construction, it will be very useful for us to develop the new generation of mapping languages: http://github.com/kg-construct/use-cases

mevs commented 2 years ago

Dear @markwilkinson

Formally, RDF is a semi-structured data model that naturally enables the representation of missing values. , if a resource X does not have a value for a property P, no RDF triple associates X with P. Of course, a missing value could be represented as a blank node. Still, a special character should be required to differentiate between missing values and the existential variables represented with blank nodes.

By respecting the definition of RDF, SDM-RDFizer does not generate properties with objects/subjects that correspond to missing values (or NULL). Thus, it implements precisely the semantics you are expecting, and there is no need to use any other mapping language to specify your transformation process.

Thank you for using the SDM-RDFizer.

Best regards, Maria-Esther