SDM-TIB / SDM-RDFizer

An Efficient RML-Compliant Engine for Knowledge Graph Construction
https://doi.org/10.5281/zenodo.3872103
Apache License 2.0
108 stars 25 forks source link

The time-consuming problem of converting csv data to RDF #59

Open nullgogo opened 3 years ago

nullgogo commented 3 years ago

Problem Description:

With 8 csv files, it took more than a day to convert about 600M data into RDF. We also tested the conversion of two csv files to RDF separately, which took more than a few hours.

Data source:

The data comes from CMDB, a total of 8 csv files, including host (18M), vm (18M), software (160M) and other data, there is a one-to-many and many-to-many semantic relationship between these data.

1

Config.ini and mapping.ttl Configuration:

2 3

Execute:

4

environment: os: centos7 cpu core:64 memory: 96G

tangyong commented 3 years ago

@eiglesias34 We request team to help us to see the above performance problem,

1 [Problem domain] Our AIOps team to build our infra operational KG using SDM-RDFizer 2 give us some suggestions or directions for deep investigation 3 if needing any other info, please tell me

Thanks!

mevs commented 3 years ago

Dear @tangyong

Many thanks for sharing this use case. We have implemented new optimization techniques to speed up the execution of the joins in the mappings. Please, let us arrange a meeting, and we can share with you the new version which is still in development stage. Please, contact me at maria.vidal@tib.eu

Best regards, Maria-Esther Vidal

tangyong commented 3 years ago

Dear @tangyong

Many thanks for sharing this use case. We have implemented new optimization techniques to speed up the execution of the joins in the mappings. Please, let us arrange a meeting, and we can share with you the new version which is still in development stage. Please, contact me at maria.vidal@tib.eu

Best regards, Maria-Esther Vidal

thanks @mevs very much! I will arrange a meeting and contact with you.

tangyong commented 3 years ago

Dear @mevs ,

I have discussed with my team that we wish to firstly obtain the new optimaized version for comparing performance improvment and feedback you again. I will send my quest to your email.

Thanks!

tangyong commented 3 years ago

Problem Description:

With 8 csv files, it took more than a day to convert about 600M data into RDF. We also tested the conversion of two csv files to RDF separately, which took more than a few hours.

Data source:

The data comes from CMDB, a total of 8 csv files, including host (18M), vm (18M), software (160M) and other data, there is a one-to-many and many-to-many semantic relationship between these data.

1

Config.ini and mapping.ttl Configuration:

2 3

Execute:

4

environment: os: centos7 cpu core:64 memory: 96G

Dear @mevs @dachafra @eiglesias34 ,

We have made a dataset for reproducing the problem and we wish to send you for assisting in investigation/fix. If you have time to help us , please telling me how to share the dataset (~800M) and we will upload the dataset into shared storage.

Thanks!
Best regards, Tang.