Unreproducible results for type values from CSV vocabularies with empty first column

schivmeister commented 8 months ago

Background

RMLMapper is a tool used as part of another library (basically a wrapper around rmlmapper and other tools) by Meaningfy to aid in the mapping of OP TED notices from XML to RDF. However, since the transformation was run by the mapping team at Meaningfy in July 2023, the same results can no longer be reproduced in November 2023, despite using the same version of rmlmapper v6.1.3.

One of these potential regressions relates to the introduction, among some of the data, of properties called epo:hasBuyerLegalType and epo:hasMainActivityType, which are themselves related to the corresponding object/value data vocabularies buyer_legal_type.csv and main_activity.csv, respectively. Help is now sought to determine what the root cause for this behaviour could be.

Problem

Expected

No occurrence of epo:hasBuyerLegalType or epo:hasMainActivityType in the resulting RDF data, wherever there is no XML element mapping in the object/value reference data vocabulary (empty first column).

Actual

Occurrences of epo:hasBuyerLegalType and epo:hasMainActivityType in the resulting RDF data with unexpected values, wherever there is no XML element mapping in the object/value reference data vocabulary (empty first column).

Observations

It was later found that the issue occurs in cases where the above-cited CSV vocabulary file has an empty cell value (no XML element and therefore no mapping to be expected). Placing a hyphen - or a white space in place of the empty first cells appears to fix this. However, this is unexpected, as the previous transformation in July 2023 did not exhibit this behaviour, and there were no such occurrences. It is uncertain if this relates in any way to #140.

MWE

As the transformation involves multiple RML files/modules, and it is not useful to prepare a very minimal example without all the contextual data, a reproduction test suite (of a mostly-minimal working example) is attached with this ticket. It contains also the MWE for another potential regression #226 identified alongside this one.

mfy-rml-mwe.zip

DylanVanAssche commented 8 months ago

Hi @schivmeister ,

Thanks for the detailed issue. You mention that both executions were with the same version of the RMLMapper, I'm not sure if it is then a bug in the RMLMapper, same for #226. If the input data was different for both executions, the results are indeed not the same. Empty values should be ignored by RMLMapper.

schivmeister commented 8 months ago

Hi @DylanVanAssche thanks for looking! That's the thing - the input data is the same, the version is the same, but we are seeing different results! The attached MWEs show exactly this.

The expected result was the one we last generated. The MWE will produce new output that is different. So, we were wondering if you might have any clue as to what else it could be. It requires a bit of time investment in following the MWE.

DylanVanAssche commented 8 months ago

I had already a look at the MWE but I fail to understand which data was the 'old' data and which is the 'new' data. I would expect that the MWE had 2 versions then, one from July and one from November? Maybe I missed it :)

schivmeister commented 8 months ago

@DylanVanAssche sorry about the confusion. The file expected.ttl is the "old" output from July. The rest of the files (the XML, RMLs, CSVs and JSONs) are all the original files used to generate that output TTL.

The scope of the MWE is to generate the "new" output actual.ttl exactly from these old resources, so that the tester can follow and compare how it comes about, both with and without applying the discovered workarounds.

Given this context, let me know if the MWE then is still hard to follow. We'll attempt to minimize whatever complexity still remains.

schivmeister commented 8 months ago

We are in the process of preparing simpler MWEs for reporting the discovered causes as new tickets. Perhaps that will allow us to better comprehend these issues, and lead the way to finding the root cause of the behaviour described here (inability to reproduce specific prior results). We will start with the potential cause identified in #226, as that is more pressing at this time.

RMLio / rmlmapper-java