SPARQL transformer is too slow

UnifiedViews / Core

UnifiedViews

https://www.poolparty.biz/agile-data-integration

Other

30 stars 7 forks source link

SPARQL transformer is too slow #19

Closed martinnec closed 10 years ago

martinnec commented 10 years ago

I know I said that optimization of ODCS is not a priority now. However, I am very surprised how SPARQL transformer works - it is very slow.

Example: See pipeline "[CZSO] Public DB - Demography", its execution from 17.3.2014 23:38:16. See how long the SPARQL Transformer "District demography Data Cube Vocabulary RDF representation" worked - for 9 minutes. However, when I go to Browse/Query; choose the DPU and its input and run the query which is executed by the transformer, the query is evaluated in few seconds.

This shows that the SPARQL transformer is O(100n) slower then the execution of the SPARQL query by the triplestore? I do not want to believe it, but it is probably true :-/.

jakubklimek commented 10 years ago

https://github.com/mff-uk/ODCS/issues/887

tomas-knap commented 10 years ago

The goal of this task is to have roughly the same processing time as when the query is executed directly in the browse/query or via "/sparql" page of Virtuoso.

Please take into account that it could have been slow because there were other pipelines running. So @marcoony, please try first on a separate run whether this is not the case

tomas-knap commented 10 years ago

@marcoony Adjust the code as needed, but please do not change the DPUConfig class. Or if you change it, please ensure that the transformer will be compatible with the current version of SPARQL transformer DPU on http://odcs.xrg.cz:8080/odcleanstore

tomas-knap commented 10 years ago

@marcoony You can also use tests associated with SPARQL transformer - see JUNIT tests for the SPARQL transformer. CHeck also whether the problem is in the query execution itself or, because the merging of the data (merging of the data - outputs of the previous DPU to the inputs of the SPARQL transformer) takes some time.

Jan-Marcek commented 10 years ago

The problem was how data was added to the destination in the BaseRDFRepo. A original verion: get all statements from graph for statement in statements: target.add(statement)

A fixed version: target.add(graph)

I tested by a junit test where difference between original and fixed version was significant. The original version consumed 40 seconds and the fixed version consumed 4 seconds. I didn't make any change in the Sparql Transformer.

I suggest to try your case again. You need to rebuild a backend. I'm going to find/correct all usage of method addStatement which is no-effective.

feb0701e845541586fb554799f110ed59a86b82c

Jan-Marcek commented 10 years ago

In addition, the name of the method addTriplesFromGraph in BaseRDFRepo doesn't make sense. I suggest to rename method to addGraph.

ghost commented 10 years ago

marcoony: according to our discussion, that method will be removed in future