SDM-TIB / SDM-RDFizer

An Efficient RML-Compliant Engine for Knowledge Graph Construction
https://doi.org/10.5281/zenodo.3872103
Apache License 2.0
107 stars 25 forks source link

multi level join not working in xml conversion #111

Closed manikyab closed 3 months ago

manikyab commented 7 months ago

When running RML conversion for a XML file with multi level join condition. Key error is raised.

TTL file ` @prefix rr: http://www.w3.org/ns/r2rml#. @prefix rml: http://semweb.mmlab.be/ns/rml#. @prefix ql: http://semweb.mmlab.be/ns/ql#. @prefix xsd: http://www.w3.org/2001/XMLSchema#. @prefix ex: http://example.com/ns#. @base http://example.com/ns#.

<#TransportMapping> a rr:TriplesMap; rml:logicalSource [ rml:source "sample.xml" ; rml:referenceFormulation ql:XPath ; rml:iterator "/transport/bus" ]; rr:subjectMap [ rr:template "http://trans.example.com/bus/{@id}"; rr:class ex:Transport ; ];

rr:predicateObjectMap [ rr:predicate ex:type ; rr:objectMap [ rr:template "http://trans.example.com/vehicle/{@type}"; ] ];

rr:predicateObjectMap [ rr:predicate ex:stop; rr:objectMap [ rr:parentTriplesMap <#StopMapping> ; rr:joinCondition [ rr:child "@id"; rr:parent "../../@id"; ] ] ].

<#StopMapping> a rr:TriplesMap; rml:logicalSource [ rml:source "sample.xml" ; rml:referenceFormulation ql:XPath ; rml:iterator "/transport/bus/route/stop" ]; rr:subjectMap [ rr:template "http://trans.example.com/stop/{@id}"; rr:class ex:Stop ]; rr:predicateObjectMap [ rr:predicate ex:stop; rr:objectMap [ rml:reference "@id"; rr:datatype xsd:int ] ]; rr:predicateObjectMap [ rr:predicate ex:stopLabel; rr:objectMap [ rml:reference "."; ] ]. `

XML File `

Airport Conference center false Central Park Conference center true

`

Expected behavior Output <http://trans.example.com/bus/25> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://example.com/ns#Transport>. <http://trans.example.com/bus/25> <http://example.com/ns#type> <http://trans.example.com/vehicle/SingleDecker>. <http://trans.example.com/bus/25> <http://example.com/ns#stop> <http://trans.example.com/stop/645>. <http://trans.example.com/bus/25> <http://example.com/ns#stop> <http://trans.example.com/stop/651>. <http://trans.example.com/bus/47> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://example.com/ns#Transport>. <http://trans.example.com/bus/47> <http://example.com/ns#type> <http://trans.example.com/vehicle/DoubleDecker>. <http://trans.example.com/bus/47> <http://example.com/ns#stop> <http://trans.example.com/stop/873>. <http://trans.example.com/bus/47> <http://example.com/ns#stop> <http://trans.example.com/stop/651>. <http://trans.example.com/stop/645> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://example.com/ns#Stop>. <http://trans.example.com/stop/645> <http://example.com/ns#stop> "645"^^<http://www.w3.org/2001/XMLSchema#int>. <http://trans.example.com/stop/645> <http://example.com/ns#stopLabel> "Airport". <http://trans.example.com/stop/651> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://example.com/ns#Stop>. <http://trans.example.com/stop/651> <http://example.com/ns#stop> "651"^^<http://www.w3.org/2001/XMLSchema#int>. <http://trans.example.com/stop/651> <http://example.com/ns#stopLabel> "Conference center". <http://trans.example.com/stop/873> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://example.com/ns#Stop>. <http://trans.example.com/stop/873> <http://example.com/ns#stop> "873"^^<http://www.w3.org/2001/XMLSchema#int>. <http://trans.example.com/stop/873> <http://example.com/ns#stopLabel> "Central Park". <http://trans.example.com/stop/651> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://example.com/ns#Stop>. <http://trans.example.com/stop/651> <http://example.com/ns#stop> "651"^^<http://www.w3.org/2001/XMLSchema#int>. <http://trans.example.com/stop/651> <http://example.com/ns#stopLabel> "Conference center".

Error in rdfizer `xception has occurred: KeyError '@' KeyError: ('../../@id',)

During handling of the above exception, another exception occurred:

File "D:\rdfizer\semantify.py", line 542, in hash_maker_xml if row.find(child_object.parent[0]).text in hash_table: File "D:\rdfizer\semantify.py", line 1891, in semantify_xml hash_maker_xml(child_root, triples_map_element, File "D:\rdfizer\semantify.py", line 6012, in semantify output_file_descriptor).result()`

eiglesias34 commented 7 months ago

Hello @manikyab,

Thank you for using SDM-RDFizer. I found the issues and fixed them. I ran your example and got the expected result. Please run it on your side, so we can close this issue.

Sincerely, Enrique

manikyab commented 7 months ago

Hello @eiglesias34 , Thanks for the prompt fix :) I am able to run the sample fix but if I am using a big XML fix the code is getting stuck in while loop in line 1553 in semantify.py file. The code on my system is running on for 10 hrs and still it is stuck on same 9 line of code

If you want I can share you the file and mapping over mail

eiglesias34 commented 7 months ago

Hello again, Please send me the data and mapping. My e-mail is eiglesias34@gmail.com. Sincerely, Enrique

manikyab commented 6 months ago

Hi @eiglesias34 I have shared the files over mail Regards, Manikya

eiglesias34 commented 6 months ago

Hello again,

Sorry for the delay. I was able to find the problem. SDM-RDFizer fell into an infinite loop; it sometimes happens in Python when multiple levels have the same name. A couple of things I noticed from the example you sent me. First, many triples maps that use the "Unit" level generate empty since the data doesn't have a "Unit" level. Second, the triples map #ProductValuesMapping_L4 given its iterator doesn't generate anything since no values exist at that level (keep that in mind). Anyway, please test it out and tell me if everything is solved.

Sincerely, Enrique

manikyab commented 6 months ago

HI @eiglesias34 Thanks a lot for all help till now it has help me lot. #ProductValuesMapping_L4 is a container for data and it can contain data or have another container for L5 or be empty The same can go on for upto 10 level With new update in mapping file we are getting error with the None mapping for same input which I am sharing you over email so you can have a look . We are also looking to work with files which are larger then 2GBs or so, what would be the way to handle it with SDM-RDFizer.

Regards Manikya

eiglesias34 commented 6 months ago

Hello @manikyab,

I fixed the problem. I ran it, and everything was good. Please test it out on your side. Thank you for using SDM-RDFizer. Your test cases have helped a lot to find problems in the transformation process for XML files.

Regarding handling large data sources, a large data source will generate a KG multiple times larger than its data source. Regardless of the tool used, creating a large KG will consume at least the same amount of memory as the size of the KG. Even if the tool is very efficient, the characteristics of the environment where the creation process is run are an essential factor. So, any environment will not be enough. With that said, when using SDM-RDFizer, you have two choices: run with duplicate removal or without. When running it with duplicate removal, the creation process will take less time and generate a duplicate-free KG, but it will consume much more memory. Running it without duplicate removal will consume a lot less memory, but it will take longer since more triples are being generated. These scenarios will occur if there are duplicate records in the data source. In the case there are no duplicates in the data source, I would recommend running SDM-RDFizer without duplicate removal since both with duplicate removal and without will generate the same amount of triples and take relatively the same amount of time, but without duplicate removal will consume less memory.

I hope this helps to answer your question.

Sincerely, Enrique Iglesias

manikyab commented 6 months ago

Hi @eiglesias34 Based on running of SDM-RDFizer and comparing its result from output from Java RMLMapper we found that many triples are missing from SDM-RDFizer generated output and some triples are incorrectly formed. I am sharing both output file in the mail to you. Please have a look at it and if you have any questions please email me back.

In file semantify.py Line No:746 it should be parent_parent_map not parent_parent_parent_map as it has no reference in the code.

Regards Manikya Bansal

eiglesias34 commented 6 months ago

Hello @manikyab,

I hope you are doing well. I'm sorry I haven't written earlier; I've had other work commitments. Here is the progress report. First of all, most of the difference in triples is because you ran SDM-RDFizer with duplicate removal while you ran RMLMapper without duplicate removal. Secondly, among the triples maps that were generating triples with RMLMapper but not with SDM-RDFizer was #ProductValuesMapping_L1. I ran SDM-RDFizer on my side and could not generate those triples. So, I ran RMLMapper and got the same result as SDM-RDFizer. I'm sending you the resulting KGs by email so you can see what I'm talking about. I ran both with duplicate removal. Afterward, I ran an Xpath generator to check if the iterators of triples were correct, and I saw that the iterator for  #ProductValuesMapping_L1 did not generate any data (I'm sending you all the Xpath that could be generated given the data you sent me by email). Therefore, I think the problem is that when you ran RMLMapper, you must have used a different data source. Just in case I'm sending you back the mapping and data you sent me so you ca. I hope this answers your question. I also wasn't able to generate the error you were talking about with the mapping and data you sent me.

Sincerely, Enrique Iglesias

manikyab commented 6 months ago

Hi @eiglesias34 , Thank you for all your help I was out of town so I was not able to reply to you promptly. I found 2 issue with data handling 1) The double inverted comma " is not getting handled properly so it creates error while data loading (line 1192 in text_xml.nt) 2) The URL is put in encoded format in nt file so it can sometime cause issue with querying(line 1240 in text_xml.nt)

If you would require any help then please let me know we connect over google meet call

Regards Manikya Bansal

eiglesias34 commented 5 months ago

Hi @manikyab, I'm sorry it took me so long to respond. I had to take care of a lot of work-related commitments. I was able to fix the issues you mentioned. I uploaded the output to GraphDB to ensure everything was correct, and it was uploaded with no problem. Please test it out on your side so that we can close this issue.

Sincerely, Enrique Iglesias

manikyab commented 5 months ago

Hi @eiglesias34 I found and fixed some issues in code based on XML parsing and created PR for same. It encodes the data based on RFC3987 as per specification defined here

Please have look and thank you so much for all your help you have given me .

Regards Manikya Bansal

eiglesias34 commented 4 months ago

Hello @manikyab,

I know it has been quite a while since I last wrote. I have been working on incorporating the new RML formulation into SDM-RDFizer, and it has taken a lot of my time. So, I incorporated your changes myself and closed the pull request. I'll be releasing the new version of SDM-RDFizer quite soon.

I appreciate your help. Sincerely, Enrique Iglesias

eiglesias34 commented 3 months ago

Hello @manikyab,

Given the inactivity of this issue and the fact that the problem was solved, I am going to close this issue. If need be, it can be opened again.

Sincerely, Enrique Iglesias

rppala3 commented 1 month ago

Hi @eiglesias34, the issue seems to persists. Installed rdfizer version: 4.7.4.9

I cut and pasted the code posted from the very first box, but the rdfizer still doesn't work:

Expected result:

<http://trans.example.com/bus/25> <http://example.com/ns#stop> <http://trans.example.com/stop/645> .
<http://trans.example.com/bus/25> <http://example.com/ns#stop> <http://trans.example.com/stop/651> .
<http://trans.example.com/bus/25> <http://example.com/ns#type> <http://trans.example.com/vehicle/SingleDecker> .
<http://trans.example.com/bus/25> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://example.com/ns#Transport> .
<http://trans.example.com/bus/47> <http://example.com/ns#stop> <http://trans.example.com/stop/651> .
<http://trans.example.com/bus/47> <http://example.com/ns#stop> <http://trans.example.com/stop/873> .
<http://trans.example.com/bus/47> <http://example.com/ns#type> <http://trans.example.com/vehicle/DoubleDecker> .
<http://trans.example.com/bus/47> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://example.com/ns#Transport> .
...

The actual result:

<http://trans.example.com/bus/25> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://example.com/ns#Transport>.
<http://trans.example.com/bus/25> <http://example.com/ns#type> <http://trans.example.com/vehicle/SingleDecker>.
<http://trans.example.com/bus/47> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://example.com/ns#Transport>.
<http://trans.example.com/bus/47> <http://example.com/ns#type> <http://trans.example.com/vehicle/DoubleDecker>.
...

The issue: the links betweek bus and the stop are missing.

manikyab commented 1 month ago

I have tried a few fixes here and there and it solves majority of the cases. Will share the code if u want to