RMLio / rmlmapper-java

The RMLMapper executes RML rules to generate high quality Linked Data from multiple originally (semi-)structured data sources
http://rml.io
MIT License
147 stars 61 forks source link

CSV: add support for csvw:null #138

Closed paulmillar closed 2 years ago

paulmillar commented 2 years ago

Motivation:

In tabular data representations, one problem is how to represent the absence of information; for example, if a field not apply to all rows what value should be placed in the cell where the information is either missing or not applicable?

A common solution is to use a place-holder value that represents the absence of information. Examples of such place-holder values include the empty string (""), a dash ("-") or a phrase (or abbreviation thereof) such as "N.A.".

It would be useful if such place-holder values were identified as such and RMLMapper refrained for making any corresponding assertions.

The CSVW namespace [1] provides metadata describing a CSV file. Within RMLMapper, this may be used to configure the CSV parser. One feature of CSVW is its ability to describe how certain values correspond to the null value. This is useful as RMLMapper will not include triples where the object value is null.

Although RMLMapper provides partial support for CSVW, this currently lacks support for csvw:null assertions.

[1] https://www.w3.org/ns/csvw

Modification:

Add limited support for csvw:null assertions.

CSVW supports potentially multiple csvw:null assertions; however, Commons CSV parser only supports a single null String (see [2]). Therefore, support for csvw:null is only partial.

[2] https://issues.apache.org/jira/browse/CSV-293

Result:

A CSV-backed mapping may now be defined with a specific string identified as a place-holder indicating missing information. If a cell contains the place-holder value then any corresponding assertions are suppressed. This is achieved using the csvw:null assertion; however, please note that current support is limited to a single csvw:null assertion; any subsequent assertions are silently ignored.

DylanVanAssche commented 2 years ago

Hi!

Thanks for your MR! It seems we had similar ideas :) Support for this was merged yesterday internally in the development branch with support for multiple NULL strings. It will be available soon in the next release! In the meantime you can try it from the development branch on Github.

Please let us know if that fixes the issue you're having ;)

paulmillar commented 2 years ago

Indeed, it's certainly encouraging that we had similar ideas. I'll definitely need to switch to the development branch!

Also, I noticed that my patch contains a mistake: the code looks for csvw:null under the csvw:Dialect, rather than the csvw:Table :see_no_evil:

My use-case is actually different. I would like to suppress specific assertions associated with the cell containing the null String value.

Here is an example:

"id";"acronym";"status";"nature";"rcn"
"824064";"ESCAPE";"SIGNED";"";"219246"

The first row is the header, the second row is the actual data. In this data, not all rows have a nature field and, for those with an empty string, I would like RMLMapper to suppress the corresponding assertions, rather than making assertions using the empty string value.

However, I do still want to generate an instance for this row, with all the non-empty cells contributing assertions.

Commit d14ac9d4 does something different. It suppresses the entire row if any of the cells contains the csvw:null value.

I can imagine this could be useful under certain circumstances; however, it's a different use-case from mine.

Going back to the definition of csvw:null:

An atomic property giving the string or strings used for null values within the data. If the string value of the cell is equal to any one of these values, the cell value is null.

Note it says cell value here. It also doesn't say what semantics a null cell value carries. Therefore, I'd say that RMLMapper is free to react to a cell value being null in whichever way it chooses.

So, perhaps this behaviour could be configured?

For example, RMLMapper could to null cell values by rejecting the specific assertions (my use-case), or by rejecting the entire row (your use-case, I guess).

Would that sound reasonable approach?

DylanVanAssche commented 2 years ago

@paulmillar Oh good catch!

However, I do still want to generate an instance for this row, with all the non-empty cells contributing assertions.

Commit d14ac9d does something different. It suppresses the entire row if any of the cells contains the csvw:null value.

I can imagine this could be useful under certain circumstances; however, it's a different use-case from mine.

Going back to the definition of csvw:null:

An atomic property giving the string or strings used for null values within the data. If the string value of the cell is equal to any one of these values, the cell value is null.

Note it says cell value here. It also doesn't say what semantics a null cell value carries. Therefore, I'd say that RMLMapper is free to react to a cell value being null in whichever way it chooses.

So, perhaps this behaviour could be configured?

This doesn't need to be configured as the behavior doesn't properly match what was intended with csvw:null :) It should behave like you said: ignore the cell instead of the whole row. Would you like to hack on this as you have a nice use case? Or do I make an internal issue so that somebody can have a look?

paulmillar commented 2 years ago

Hmmm...

I don't know if I'm missing something here but, looking at functional test RMLTC1002a_null-CSVW, it seems like skipping the row is the behaviour @winniederidder intended.

Perhaps we should come up with a consensus view on what effect csvw:null should have, just to make sure everyone's happy.

In terms of hacking on this, yes, I'd be happy to; however, I can only working on this in my spare time, which is (at the moment) very limited and unpredictable. So, I wouldn't want to promise anything!

winniederidder commented 2 years ago

@DylanVanAssche Was the intended behaviour not ignoring the entire row? Else the column containing a null value should be set to null, which of course isn't too difficult and can easily be added. But ignoring is how I understood the original issue atleast.

DylanVanAssche commented 2 years ago

@winniederidder I misread the original issue when checking the MR.

Else the column containing a null value should be set to null, which of course isn't too difficult and can easily be added.

Lets change the behavior into this :) We can use withNullString() to set the NULL string to the first value of csvw:null and before processing the CSV file, we replace all other possible NULL values provided by csvw:null with the first value of csvw:null.

DylanVanAssche commented 2 years ago

@paulmillar Don't worry about it ;) We will fix this. I just wanted to avoid that we both do the same work.

paulmillar commented 2 years ago

OK, thanks.

DylanVanAssche commented 2 years ago

@paulmillar We pushed some new commits to development on Github, feel free to check them out and let us know if your issue is resolved :)

paulmillar commented 2 years ago

Hi @DylanVanAssche ,

I've checked the development branch and it works perfectly for me.

Thanks again!

DylanVanAssche commented 2 years ago

Awesome! Will be available in the next release then :)