SDM-TIB / SDM-RDFizer

An Efficient RML-Compliant Engine for Knowledge Graph Construction
https://doi.org/10.5281/zenodo.3872103
Apache License 2.0
107 stars 25 forks source link

Wrong number of triples obtained with large_file in yes #77

Closed jatoledo closed 2 years ago

jatoledo commented 2 years ago

Hi!

We are running SDM-RDFizer v4.0.5 with GTFS-Madrid-Bench. For the case of CSV with scaling factor 1 we obtained a different number of triples than with rmlmapper and Morph-KGC (that we have verified that obtain the same number of triples). Also for SDM-RDFizer, depending on the execution, a different number of triples are obtained (e.g. for some runs it obtains 103038 triples and for others it obtains 337121 number of triples). The config file that we are using is:

[default]
main_directory: .

[datasets]
number_of_datasets: 1
output_folder: ${default:main_directory}/output_rdfizer
remove_duplicate: yes
all_in_one_file: yes
name: gtfs
enrichment: yes
large_file: yes
ordered: yes

[dataset1]
name: gtfs
mapping: ${default:main_directory}/mapping.csv.ttl

and our mapping is attached below

mapping.csv.ttl.txt

Below is the log with the number of triples obtained by rmlmapper:

11:16:19.501 [main] INFO  be.ugent.rml.cli.Main               .writeOutputTargets(415) - Target: <rmlmapper://default.store> has 395953 results

Thanks in advance

eiglesias34 commented 2 years ago

Hello @jatoledo,

I have run this dataset before and the difference in triples comes from the fact that the SDM-RDFizer does not generate triples in which the original value is None or empty, while the rmlmapper does generate these triples.

I hope this helps

dachafra commented 2 years ago

@eiglesias34 the GTFS-Madrid-Bench does not have any empty value in the original data, so that could not be the reason of the difference among the engines. We have run RMLMapper and SDM-RDFizer, sort the results and make a diff command. Please see below that the are triples from this TriplesMap, and also from this one, that are not being generating.

Could you take a look? I just remember that this problem appeared in the past as well! results_and_diff.zip

jatoledo commented 2 years ago

Hi @eiglesias34 , here attached the dataset :

dataset

eiglesias34 commented 2 years ago

Hello @jatoledo,

I just ran the data that you sent me and got this result kgc.zip

I got the same number of triples. I am running the most recent version of the SDM-RDFizer.

This is the config file I'm running.

[default]
main_directory: /home/enrique/Documents

[datasets]
number_of_datasets: 1
output_folder: ./issue/graph
all_in_one_file: no
remove_duplicate: yes
enrichment: yes
name: output
ordered: yes
large_file: false

[dataset1]
name: kgc
mapping: ${default:main_directory}/mapping.ttl

I hope this helps.

dachafra commented 2 years ago

problem finally found. The number of results when large_file: true is not the expected one

eiglesias34 commented 2 years ago

I was able to find the problem and fix it.

dachafra commented 2 years ago

now, with ordered in true and large_file (yes/no) it works