UnifiedViews / Core

UnifiedViews
https://www.poolparty.biz/agile-data-integration
Other
30 stars 7 forks source link

T-metadata and T-sparql very slow - Core/Storage performance problem - Problem with too many graphs in the query #226

Closed jakubklimek closed 9 years ago

jakubklimek commented 9 years ago

I had a suspicion for a long time but today I confirmed it. There is some kind of problem maybe with t-metadata and t-sparql, but more likely it seems like a Core or even Storage issue, becasue from the code there is not much that could be improved in t-metadata - On https://github.com/UnifiedViews/Plugins/blob/master/t-metadata/src/main/java/eu/unifiedviews/plugins/transformer/metadata/Metadata.java#L250 T-metadata issues 6 basic SPARQL queries to compute statistics. These queries take a VERY long time, depending on the size of data. For the RUIAN dataset (600M triples), the first (or maybe first two) queries run for 2 days now and counting, Sesame storage is taking approx. 125% of CPU for the whole time.

I issued the same 6 queries in our Virtuoso instance of this dataset, running on a 16GB RAM machine, and all 6 together were computed within an hour.

This means that the whole RUIAN pipeline is doable (outside of UnifiedViews) in approx. 15 hours.

I suggest loading the RUIAN data to a standalone Sesame and issuing these queries to rule out Core problems. Then, I suggest switching to Virtuoso (which means dealing with the storage-per-UV problem and possible Virtuoso bugs).

ghost commented 9 years ago

As I told you and Petr in Bratislava, you can test this concrete pipeline with Virtuoso even without solving t-sparql graphs issues. Looking forward to see any results from it.

On 11/05/2014 11:38 AM, jakubklimek wrote:

I had a suspicion for a long time but today I confirmed it. There is some kind of problem maybe with t-metadata and t-sparql, but more likely it seems like a Core or even Storage issue, becasue from the code there is not much that could be improved in t-metadata - On https://github.com/UnifiedViews/Plugins/blob/master/t-metadata/src/main/java/eu/unifiedviews/plugins/transformer/metadata/Metadata.java#L250 T-metadata issues 6 basic SPARQL queries to compute statistics. These queries take a VERY long time, depending on the size of data. For the RUIAN dataset (600M triples), the first (or maybe first two) queries run for 2 days now and counting, Sesame storage is taking approx. 125% of CPU for the whole time.

I issued the same 6 queries in our Virtuoso instance of this dataset, running on a 16GB RAM machine, and they were all computed within an hour.

This means that the whole RUIAN pipeline is doable (outside of UnifiedViews) in approx. 15 hours.

I suggest loading the RUIAN data to a standalone Sesame and issuing this queries to rule out Core problems. Then, I suggest switching to Virtuoso (which means dealing with the storage-per-UV problem and possible Virtuoso bugs).

— Reply to this email directly or view it on GitHub https://github.com/UnifiedViews/Core/issues/226.

jakubklimek commented 9 years ago

Well I am doing the RUIAN pipeline manually, with Virtuoso, outside of UV and these are the results (15 hours). The only thing left is to configure UV to do it in Virtuoso. But OK, we will do that and let you know.

ghost commented 9 years ago

That should be matter of minutes.

On 11/05/2014 11:50 AM, jakubklimek wrote:

Well I am doing the RUIAN pipeline manually, with Virtuoso, outside of UV and these are the results (15 hours). The only thing left is to configure UV to do it in Virtuoso. But OK, we will do that and let you know.

— Reply to this email directly or view it on GitHub https://github.com/UnifiedViews/Core/issues/226#issuecomment-61789929.

jakubklimek commented 9 years ago

Tried, but just changing to Virtuoso RDF storage is not working because symbolic names implementation does not support Virtuoso correctly.

Execution failed because: Error when downloading. Symbolic name /D32-datasets.csv from location http://opendata.cz/kuba/D32-datasets.csv could not be saved to file:///var/lib/uv/working/exec_203890560292568054301/D32dataset8896520579453326676eu.unifiedviews.dpu.DPUException: Error when downloading. Symbolic name /D32-datasets.csv from location http://opendata.cz/kuba/D32-datasets.csv could not be saved to file:///var/lib/uv/working/exec_203890560292568054301/D32dataset8896520579453326676 at
 eu.unifiedviews.plugins.extractor.httpdownload.HttpDownload.execute(HttpDownload.java:82) at
 cz.cuni.mff.xrg.odcs.backend.execution.dpu.DPUExecutor.executeInstance(DPUExecutor.java:231) at
 cz.cuni.mff.xrg.odcs.backend.execution.dpu.DPUExecutor.execute(DPUExecutor.java:369) at
 cz.cuni.mff.xrg.odcs.backend.execution.dpu.DPUExecutor.run(DPUExecutor.java:451) at
 java.lang.Thread.run(Thread.java:745) Caused by: eu.unifiedviews.dataunit.DataUnitException: org.openrdf.query.UpdateExecutionException: : SPARQL execute failed:[DELETE {?s ?predicate ?o} INSERT {?s ?predicate ?object} WHERE { ?s ?symbolicName. OPTIONAL {?s ?predicate ?o} } ] Exception:virtuoso.jdbc4.VirtuosoException: SQ074: Line 2: SP031: SPARQL compiler: No default graph specified in the preamble, but it is needed for triple constructor in DELETE {...} without GRAPH {...} at 
eu.unifiedviews.helpers.dataunit.internal.metadata.MetadataHelpers$WritableMetadataHelperImpl.set(MetadataHelpers.java:322) at 
eu.unifiedviews.helpers.dataunit.virtualpathhelper.VirtualPathHelpers$VirtualPathHelperImpl.setVirtualPath(VirtualPathHelpers.java:130) at 
eu.unifiedviews.plugins.extractor.httpdownload.HttpDownload.execute(HttpDownload.java:77) ... 4 more Caused by: org.openrdf.query.UpdateExecutionException: : SPARQL execute failed:[DELETE {?s ?predicate ?o} INSERT {?s ?predicate ?object} WHERE { ?s ?symbolicName. OPTIONAL {?s ?predicate ?o} } ] Exception:virtuoso.jdbc4.VirtuosoException: SQ074: Line 2: SP031: SPARQL compiler: No default graph specified in the preamble, but it is needed for triple constructor in DELETE {...} without GRAPH {...} at virtuoso.sesame2.driver.VirtuosoRepositoryConnection.executeSPARUL(Unknown Source) at virtuoso.sesame2.driver.VirtuosoRepositoryConnection$4.execute(Unknown Source) at eu.unifiedviews.helpers.dataunit.internal.metadata.MetadataHelpers$WritableMetadataHelperImpl.set(MetadataHelpers.java:320) ... 6 more
skrchnavy commented 9 years ago

@pshoda / @jakubklimek : please confirm that this is solved in virtuoso feature branch

jakubklimek commented 9 years ago

Not really. There is a workaround for specific cases ("per-graph" execution in t-sparqlConstruct & t-sparqlUpdate followed by a graph merger and then t-metadata) but generally the problem is that when there is a larger number of graphs (hundreds and more) the triplestore (both sesame and virtuoso) is unable to process a query with so many FROM clauses or graphs set in the Dataset.

tomas-knap commented 9 years ago

Closed, as it duplicates: https://github.com/UnifiedViews/Core/issues/188