SDM-TIB / SDM-RDFizer

An Efficient RML-Compliant Engine for Knowledge Graph Construction
https://doi.org/10.5281/zenodo.3872103
Apache License 2.0
108 stars 25 forks source link

not all TripleMaps are processed for each row #24

Closed VladimirAlexiev closed 4 years ago

VladimirAlexiev commented 4 years ago

I have a moderate table of 1.344M rows and 23 fields. The fields are mapped to 15 nodes and 33 triples. The mapping looks like this:

<person/(customer_id)!map>
        a                      rr:TriplesMap ;
        rml:logicalSource      [ rr:tableName  "_final.customer" ] ;
        rr:predicateObjectMap  <person/(customer_id)!hasAddress!address> , <person/(customer_id)!hasEvent!birth> , <person/(customer_id)!firstName!(first_name)> , <person/(customer_id)!religion!(religion)> , <person/(customer_id)!lastName!(last_name)> , ## total 23 ;
        rr:subjectMap          <person/(customer_id)!subj> .

<person/(customer_id)/birth!map>
        a                      rr:TriplesMap ;
        rml:logicalSource      [ rr:tableName  "_final.customer" ] ;
        rr:predicateObjectMap  <person/(customer_id)/birth!hasDate!(date_of_birth)> ;
        rr:subjectMap          <person/(customer_id)/birth!subj> .

<person/(customer_id)/email!map>
        a                      rr:TriplesMap ;
        rml:logicalSource      [ rr:tableName  "_final.customer" ] ;
        rr:predicateObjectMap  <person/(customer_id)/email!value!(email_address)> ;
        rr:subjectMap          <person/(customer_id)/email!subj> .
... ## total 14 TriplesMap, 358 rml triples

The mapping is generated from a semantic model using my tool rdf2rml, that's why I don't use a single rml:logicalSource but several in blank nodes.

Your tool makes nearly all triples from birth!map, then 8 triples from email!map and then quits. This is on Postgres (I'll run this again to check).

Even if it "rewound" the database (reran the select * query) to process all maps in sequence, that's considerably slower than iterating each row once and processing all maps on that row (which would cause all triples for one customer to be emitted together).

I'll try replacing the blank nodes with a single rml:logicalSource and see if that helps.

VladimirAlexiev commented 4 years ago

Using a single logicalSource like this doesn't help:

<customer!source> a rml:BaseSource;
  rr:tableName  "_final.customer".

<person/(customer_id)!map>
   rml:logicalSource <customer!source>.

The abort after 8 email triples is caused by #25 so I have hope that it will process all triples. But still processing all per row will be faster.

eiglesias34 commented 4 years ago

Hopefully, by solving issue #25 this problem is solved. Thank you for the suggestion regarding the execution per row. The RDFizer works under the assumption that each individual triples map has a different logical source. What you are recommending could be useful if we work under the assumption that each individual triples map has the same source.

VladimirAlexiev commented 4 years ago

@eiglesias34 I think you don't need to assume:

In either case, you can process such sources once. Here's some pseudo-code:

  1. Group sources by equivalence
  2. Group TripleMaps by the use of equivalent sources
  3. Iterate over TripleMap Groups (tmg)
    1. Evaluate the shared source of tmg
    2. Iterate over each source row
      1. Iterate over TripleMaps (tm) in the current tmg
        1. Produce all triples for tm and the current row
        2. Output all these triples

Cheers!

eiglesias34 commented 4 years ago

Thank you very much for the suggestion. I will take this into consideration for the following release of the RDFizer.

dachafra commented 4 years ago

Seems that issue is solved, closing...