RMLio / rmlmapper-java

The RMLMapper executes RML rules to generate high quality Linked Data from multiple originally (semi-)structured data sources
http://rml.io
MIT License
146 stars 61 forks source link

Function `decide` checks only the first element of a list #152

Open mikechernev opened 2 years ago

mikechernev commented 2 years ago

Disclaimer

Hey, I am not sure if this is really a bug or me not understanding how the function works. I am also relatively new to RML, so please be gentle :)

Problem

When using idlab-fn:decide on list elements it only checks the first element and completely ignores the rest.

Steps to reproduce

  1. Iterate over a list of objects that have a property which is also a list Example:

    {
    "people": [
      {
         "id": "UniqueID",
         "name": "Mike",
         "contacts": [
            {
               "type": "phone",
               "value": "123456790"
            },
            {
               "type": "email",
               "value": "mike@is.cool"
            }
         ]
      }
    ]
    }
  2. When iterating over the people try to get the phone and the email based on the value of contacts[*].type Example:

@prefix rml: <http://semweb.mmlab.be/ns/rml#> .
@prefix rr: <http://www.w3.org/ns/r2rml#> .
@prefix ql: <http://semweb.mmlab.be/ns/ql#> .
@prefix fnml:   <http://semweb.mmlab.be/ns/fnml#> .
@prefix fno:    <https://w3id.org/function/ontology#> .
@prefix idlab-fn: <http://example.com/idlab/function/> .
@prefix mike: <http://example.com/ontology/mike/> .
@base <http://example.com/resource/> .

<Person>
  a rr:TriplesMap;
  rml:logicalSource [
    rml:source "./mike.json";
    rml:referenceFormulation ql:JSONPath;
    rml:iterator "$.people[*]"
  ];

  rr:subjectMap [
    rr:template "http://example.com/resource/entity/{.id}";
    rr:graphMap [ rr:constant "http://example/resource/person"]
  ];

  rr:predicateObjectMap[
    rr:predicate mike:name;
    rr:objectMap [ rml:reference ".name" ]
  ];

  rr:predicateObjectMap [
    rr:predicate mike:telephone;
    rr:objectMap [
      fnml:functionValue [
        rr:predicateObjectMap [
          rr:predicate fno:executes ;
          rr:objectMap [ rr:constant idlab-fn:decide ]
        ];
        rr:predicateObjectMap [
          rr:predicate idlab-fn:str ;
          rr:objectMap [ rml:reference ".contacts[*].type" ]
        ];
        rr:predicateObjectMap [
          rr:predicate idlab-fn:expectedStr ;
          rr:objectMap [ rr:constant "phone" ]
        ];
        rr:predicateObjectMap [
          rr:predicate idlab-fn:result ;
          rr:objectMap [ rml:reference ".contacts[*].value"  ]
        ];
      ] ;
    ]
  ];

  rr:predicateObjectMap [
    rr:predicate mike:email;
    rr:objectMap [
      fnml:functionValue [
        rr:predicateObjectMap [
          rr:predicate fno:executes ;
          rr:objectMap [ rr:constant idlab-fn:decide ]
        ];
        rr:predicateObjectMap [
          rr:predicate idlab-fn:str ;
          rr:objectMap [ rml:reference ".contacts[*].type" ]
        ];
        rr:predicateObjectMap [
          rr:predicate idlab-fn:expectedStr ;
          rr:objectMap [ rr:constant "email" ]
        ];
        rr:predicateObjectMap [
          rr:predicate idlab-fn:result ;
          rr:objectMap [ rml:reference ".contacts[*].value"  ]
        ];
      ] ;
    ]
  ];
.
  1. Execute the RML file using the java mapper

Expected result

<http://example.com/resource/entity/UniqueID> <http://example.com/ontology/mike/name> "Mike" <http://example/resource/person>.
<http://example.com/resource/entity/UniqueID> <http://example.com/ontology/mike/telephone> "123456790" <http://example/resource/person>.
<http://example.com/resource/entity/UniqueID> <http://example.com/ontology/mike/email> "mike@is.cool" <http://example/resource/person>.

Actual result

<http://example.com/resource/entity/UniqueID> <http://example.com/ontology/mike/name> "Mike" <http://example/resource/person>.
<http://example.com/resource/entity/UniqueID> <http://example.com/ontology/mike/telephone> "123456790" <http://example/resource/person>.

Conclusion

While debugging this I decided to change the order of the elements of the .people.contacts[] list and realised that if the email is the first element it matches it, but misses the phone. To further validate this I added an element with a completely different type to be first element of the list and then neither the phone nor the email are matched. This leads me to believe that even though I am using a list, decide only checks the first element and stops the execution.

End disclaimer

Maybe I am using decide wrong or maybe I am not using the right function for the task I am trying to achieve. Any help help will be greatly appreciated.

mikechernev commented 2 years ago

I did some digging in the code and it seems like this is the implemented behaviour https://github.com/RMLio/rmlmapper-java/blob/master/src%2Fmain%2Fjava%2Fbe%2Fugent%2Frml%2Ffunctions%2FFunctionModel.java#L93-L101 - any function that gets a list of elements will only use the first element of the provided list.

Is it making sense to change this to execute the function against every element instead?

bjdmeest commented 2 years ago

There's actually two things:

First, the decide function expects an rdf:string, not an array, so it would be better to create a new function that's a combination of the decide function and listContainsElement, or nest the listContainsElement in an if function

However, I'm not sure this would solve the actual issue: it's currently underspecified what to do with .contacts[*].type vs .contacts[*].value. I think it will not pairwise process type and value (which I assume is the expected behavior), but instead will process the type list and value list. This is, in fact, an open mapping challenge that can be handled by accessing the 'uniqueID' field in the $.people[*].contacts[*] iteration: https://github.com/kg-construct/mapping-challenges/issues/20

afaict, I don't see a way to solve your specific use case currently without preprocessing the input data, however, there's a slim chance that my first suggestion does work out of the box as expected

mikechernev commented 2 years ago

Thanks for the detailed explanation @bjdmeest. You are correct in your assumption about the mapping between the type and the value.

My assumption was that if a function accepts a string and I pass a list of strings it will iterate over the list and execute the function for each element, similarly to the way the mapping works if a reference is passed. Looking at the code that would require to change the way the functions are executed and I'm not sure if it's even possible.

That's why I already did what you initially suggested and created a new function which takes two lists (one to validate against and one to use for the result) and a string to match. That works perfect for the use case I have, so I'll probably stick with it. (Please let me know if it makes sense to contribute this with a PR, since it's as a very niche scenario that might not be valuable for anyone else). Thanks again for all the help and the explanations :)

Cheers, Mike