SDM-TIB / SDM-RDFizer

An Efficient RML-Compliant Engine for Knowledge Graph Construction
https://doi.org/10.5281/zenodo.3872103
Apache License 2.0
107 stars 25 forks source link

Joins crash the SDM-RDFizer with multiple relations #95

Closed DylanVanAssche closed 1 year ago

DylanVanAssche commented 1 year ago

Describe the bug

While joins are working in 4.6.3.2 for 1-1 relations, it is not when multiple relations are involved: 1-N or N-1. This was observed with MySQL, but I suggest to test the fix for this with PostgreSQL also. Thanks! I attached the data: data.zip

ERROR

Traceback (most recent call last):
File "//sdm-rdfizer/rdfizer/run_rdfizer.py", line 3, in <module>
Semantifying out...
TM: http://ex.com/#TriplesMap1
semantify(str(sys.argv[1]))
File "/sdm-rdfizer/rdfizer/rdfizer/semantify.py", line 4514, in semantify
number_triple += executor.submit(semantify_postgres, row, row_headers, triples_map, triples_map_list, output_file_descriptor,config[dataset_i]["user"], config[dataset_i]["password"], config[dataset_i]["db"], config[dataset_i]["host"]).result()
File "/usr/lib/python3.10/concurrent/futures/_base.py", line 458, in result
return self.__get_result()
File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
raise self._exception
File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
result = self.fn(*self.args, **self.kwargs)
File "/sdm-rdfizer/rdfizer/rdfizer/semantify.py", line 3771, in semantify_postgres
hash_maker_array(cursor, triples_map_element, predicate_object_map.object_map)
File "/sdm-rdfizer/rdfizer/rdfizer/semantify.py", line 510, in hash_maker_array
element =row[row_headers.index(child_object.parent[0])]
ValueError: 'p1' is not in list

MAPPING

@prefix d2rq: <http://www.wiwiss.fu-berlin.de/suhl/bizer/D2RQ/0.1#> .
@prefix ex: <http://example.com/> .
@prefix rml: <http://semweb.mmlab.be/ns/rml#> .
@prefix rr: <http://www.w3.org/ns/r2rml#> .

<http://ex.com/#TriplesMap1> a rr:TriplesMap ;
    rml:logicalSource [ a rml:LogicalSource ;
            rml:source [ a d2rq:Database ;
                    d2rq:jdbcDSN "jdbc:mysql://MySQL:3306/db" ;
                    d2rq:jdbcDriver "jdbc:mysql" ;
                    d2rq:password "root" ;
                    d2rq:username "root" ] ;
            rr:sqlVersion rr:SQL2008 ;
            rr:tableName "data1" ] ;
    rr:predicateObjectMap [ a rr:PredicateObjectMap ;
            rr:objectMap [ a rr:ReferenceObjectMap ;
                    rr:joinCondition [ a rr:JoinCondition ;
                            rr:child "p1" ;
                            rr:parent "p1" ] ;
                    rr:parentTriplesMap <http://ex.com/#TriplesMap2> ] ;
            rr:predicateMap [ a rr:PredicateMap ;
                    rr:constant ex:j1 ] ] ;
    rr:subjectMap [ rr:template "http://ex.com/table1/{id}" ] .

<http://ex.com/#TriplesMap2> a rr:TriplesMap ;
    rml:logicalSource [ a rml:LogicalSource ;
            rml:source [ a d2rq:Database ;
                    d2rq:jdbcDSN "jdbc:mysql://MySQL:3306/db" ;
                    d2rq:jdbcDriver "jdbc:mysql" ;
                    d2rq:password "root" ;
                    d2rq:username "root" ] ;
            rr:sqlVersion rr:SQL2008 ;
            rr:tableName "data2" ] ;
    rr:subjectMap [ rr:template "http://ex.com/table2/{id}" ] .

To Reproduce Steps to reproduce the behavior (and resources):

  1. Run SDM-RDFizer
  2. Join fails with finding on what to join

Expected behavior

Joins always works ;)

Desktop (please complete the following information):

eiglesias34 commented 1 year ago

Hi @DylanVanAssche,

I found the problem and fixed it. The problem was that since I generate a SQL query from a triples map when it doesn't have a query when I did that for the parent triples map in the example, it was lacking the column for the join condition, so when the part of the code that executes the join received the data, it wasn't able to find the corresponding value. I only extract the data that the triples map needs.

Thank you again Enrique

DylanVanAssche commented 1 year ago

Hi @eiglesias34 !

Thanks for getting back to me :) I tried 4.6.3.3 but I still have the issue with the same dataset and mapping:

Semantifying out...
TM: http://ex.com/#TriplesMap1
Traceback (most recent call last):
File "//sdm-rdfizer/rdfizer/run_rdfizer.py", line 3, in <module>
semantify(str(sys.argv[1]))
File "/sdm-rdfizer/rdfizer/rdfizer/semantify.py", line 4498, in semantify
number_triple += executor.submit(semantify_mysql, row, row_headers, triples_map, triples_map_list, output_file_descriptor, config[dataset_i]["host"], int(config[dataset_i]["port"]), config[dataset_i]["user"], config[dataset_i]["password"],config[dataset_i]["db"]).result()
File "/usr/lib/python3.10/concurrent/futures/_base.py", line 458, in result
return self.__get_result()
File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
raise self._exception
File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
result = self.fn(*self.args, **self.kwargs)
File "/sdm-rdfizer/rdfizer/rdfizer/semantify.py", line 3104, in semantify_mysql
hash_maker_array(cursor, triples_map_element, predicate_object_map.object_map)
File "/sdm-rdfizer/rdfizer/rdfizer/semantify.py", line 510, in hash_maker_array
element =row[row_headers.index(child_object.parent[0])]
ValueError: 'p1' is not in list
DylanVanAssche commented 1 year ago

4.6.3.4 works :) Thanks!

eiglesias34 commented 1 year ago

No problem.