RMLio / rmlmapper-java

The RMLMapper executes RML rules to generate high quality Linked Data from multiple originally (semi-)structured data sources
http://rml.io
MIT License
146 stars 61 forks source link

columns with "" names in CSV tables cannot be mapped #179

Closed bblfish closed 1 year ago

bblfish commented 2 years ago

We have a table with the following structure

,timestamp,data_source,index,modality,count,locationrange,speed,measurement_type,refRoadSegment
2880,2021-09-30 22:00:00+00:00,cropland,schoolstraat,,2865.9627,"POLYGON ((4.472119613057031 51.02073678178503, 4.479745501704901 51.02015981281664, 4.481188984378538 51.01915526700179, 4.481395603399528 51.01491440180675, 4.482127375233162 51.0126494593307, 4.478332541440828 51.00987874451822, 4.471306586318185 51.00931972134209, 4.469785136941451 51.01107192511326, 4.461581573935097 51.01126112376583, 4.460354540525603 51.01217411043578, 4.46032051949892 51.01700860830363, 4.468571374013287 51.01807235138044, 4.472119613057031 51.02073678178503))",,,
2881,2021-09-30 22:15:00+00:00,cropland,schoolstraat,,3788.4589,"POLYGON ((4.472119613057031 51.02073678178503, 4.479745501704901 51.02015981281664, 4.481188984378538 51.01915526700179, 4.481395603399528 51.01491440180675, 4.482127375233162 51.0126494593307, 4.478332541440828 51.00987874451822, 4.471306586318185 51.00931972134209, 4.469785136941451 51.01107192511326, 4.461581573935097 51.01126112376583, 4.460354540525603 51.01217411043578, 4.46032051949892 51.01700860830363, 4.468571374013287 51.01807235138044, 4.472119613057031 51.02073678178503))",,,
2882,2021-09-30 22:30:00+00:00,cropland,schoolstraat,,4004.3362,"POLYGON ((4.472119613057031 51.02073678178503, 4.479745501704901 51.02015981281664, 4.481188984378538 51.01915526700179, 4.481395603399528 51.01491440180675, 4.482127375233162 51.0126494593307, 4.478332541440828 51.00987874451822, 4.471306586318185 51.00931972134209, 4.469785136941451 51.01107192511326, 4.461581573935097 51.01126112376583, 4.460354540525603 51.01217411043578, 4.46032051949892 51.01700860830363, 4.468571374013287 51.01807235138044, 4.472119613057031 51.02073678178503))",,,

The name of the first column is missing. But the following does not work

@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix rr: <http://www.w3.org/ns/r2rml#>.
@prefix rml: <http://semweb.mmlab.be/ns/rml#>.
@prefix ql: <http://semweb.mmlab.be/ns/ql#>.
@prefix sosa: <http://www.w3.org/ns/sosa/> .
@prefix dc: <http://purl.org/dc/elements/1.1/> .

<http://example.com/#allDataMap> a rr:TriplesMap;
    rml:logicalSource [
        rml:source "data/all_data.csv" ;
        rml:referenceFormulation ql:CSV
    ];                     
    rr:subjectMap [
        rr:template "http://data.example.com/all#m_{}";
        rr:class sosa:Observation;
    ];
   rr:predicateObjectMap [
        rr:predicate sosa:resultTime;
        rr:objectMap [
            rml:reference "timestamp";
            rr:datatype xsd:string
        ]
    ] .

It seems to work if one gives the column a one white space name and changes the template tline to rr:template "http://data.example.com/all#m_{ }"; .

pheyvaer commented 2 years ago

Hi @bblfish

That is expected. In the first case the header is invalid. In the second case not.

bblfish commented 2 years ago

An empty header is invalid according to which spec?

I checked an csvw also seems to have difficulty with it, but there I can use {_row} as a counter for the row.

I don't really see why one could not just assume the column is named "", which is after all a String too...

pheyvaer commented 2 years ago

Well, that's why CSVW was add to deal with this types of cases. So I can assume that you have a solution if you got it working with CSVW?

bblfish commented 2 years ago

I got it to work with CSVW which I used beforehand using only CSVW tools. It was not yet clear to me that I could use that here too...

But csvw does not solve the problem either, as it only allows one to capture the number of the row, not the number in the first column of the first row, which could be useful for creating <#n{id}> urls to reference data later.

This is perhaps not a big deal. I was just wondering if you had a spec I could at a later date refer the folks producing the CSV data to so that they could change the column name, though we can probably also do that ourselves. (even so a justification for why we do that would be useful)

pheyvaer commented 2 years ago

Normally you should be able to make that work with CSVW. @DylanVanAssche Could you have a look at this?

DylanVanAssche commented 2 years ago

@pheyvaer AFAIK this is not possible, a column name is necessary for referencing. There's no way to refer to a certain column by number. I don't know how CSVW handles that.

bblfish commented 2 years ago

The discussion is going on here: https://github.com/kg-construct/rml-questions/discussions/25

It looks like we established that the restriction does not stem from the CSV RFC at least.

bblfish commented 2 years ago

I found a workaround . If I set the csvw:null to be the string consisting of the character only (of course the string "null" would also do), then the empty column can be accessed.

@prefix rr: <http://www.w3.org/ns/r2rml#> .
@base <http://example.org/> .  ## see issue https://github.com/RMLio/rmlmapper-java/issues/178

@prefix rml: <http://semweb.mmlab.be/ns/rml#>.
@prefix ql: <http://semweb.mmlab.be/ns/ql#>.
@prefix csvw: <http://www.w3.org/ns/csvw#> .

<#allDataMap>  a rr:TriplesMap;
    rml:logicalSource [
        rml:referenceFormulation ql:CSV ;
        rml:source  <#all_data.very_short.csv>;
    ] .

<#all_data.very_short.csv> a csvw:Table;
    csvw:url "data/all_data.very_short.csv" ;
    csvw:dialect [ a csvw:Dialect;
                   csvw:delimiter ","
                 ];
    csvw:null "␀".
$ rmlmapper  -m awv.sources.ttl -s turtle -m column1.ttl | head
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix sosa: <http://www.w3.org/ns/sosa/> .

_:b1485 a sosa:Observation .

_:b1486 a sosa:Observation .

_:b2880 a sosa:Observation .

_:b2881 a sosa:Observation .

For more details still see https://github.com/kg-construct/rml-questions/discussions/25#discussioncomment-3282206

DylanVanAssche commented 1 year ago

A workaround for this corner case has been found upstream, can we close this issue?

bblfish commented 1 year ago

sure!