Letractively / rdflib

Automatically exported from code.google.com/p/rdflib
Other
0 stars 0 forks source link

querying a tiny Sleepycat db takes 30 seconds #100

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago

I'm running this script to query a pre-populated RDF store. The store only
has two small RDFa files indxed in it...

#!/usr/bin/env python

# queries an RDF quadstore

from rdflib.graph import ConjunctiveGraph

g = ConjunctiveGraph("Sleepycat")
g.open("store", create=True)

q1="""PREFIX foaf: <http://xmlns.com/foaf/0.1/>
    SELECT ?src1 ?src2 ?x 
    WHERE {
        GRAPH ?src1 { ?gr1 foaf:member [ foaf:openid ?x ] }
        GRAPH ?src2 { ?gr2 foaf:member [ foaf:openid ?x ] }
        FILTER ( ?src1 != ?src2 )
    }"""

for src1, src2, x in  g.query(q1):
    print src1, src2, x

TellyClub:wordpress danbri$ time ./querytest.py 
/Users/danbri/working/rdflib/rdflib-read-only/rdflib/store/MySQL.py:10:
UserWarning: MySQLdb is not installed
  warnings.warn("MySQLdb is not installed")
/Users/danbri/working/rdflib/rdflib-read-only/rdflib/store/AbstractSQLStore.py:5
:
DeprecationWarning: the sha module is deprecated; use the hashlib module
instead
  import sha,sys, weakref
/Users/danbri/working/rdflib/rdflib-read-only/rdflib/sparql/Query.py:1:
DeprecationWarning: the sets module is deprecated
  import types, sets, sys
http://danbri.org/words/network http://inkdroid.org/journal/network
http://danbri.org/
http://inkdroid.org/journal/network http://danbri.org/words/network
http://danbri.org/

real    0m33.301s
user    0m32.178s
sys 0m0.376s

Here's the script that loads the db:

#!/usr/bin/env python

from rdflib.graph import ConjunctiveGraph

# Test basic SPARQL aggregation of the RDFa data
# see also http://identi.ca/notice/17728953 http://identi.ca/notice/17729227

g = ConjunctiveGraph("Sleepycat")
g.open("store", create=True)
g.parse("http://inkdroid.org/journal/network", format='rdfa', lax=True)
g.parse("http://danbri.org/words/network", format='rdfa', lax=True)

q1="""PREFIX foaf: <http://xmlns.com/foaf/0.1/>
    SELECT ?src1 ?src2 ?x 
    WHERE {
        GRAPH ?src1 { ?gr1 foaf:member [ foaf:openid ?x ] }
        GRAPH ?src2 { ?gr2 foaf:member [ foaf:openid ?x ] }
        FILTER ( ?src1 != ?src2 )
    }"""

for src1, src2, x in  g.query(q1):
    print src1, src2, x

Original issue reported on code.google.com by danbrick...@gmail.com on 29 Dec 2009 at 6:30

GoogleCodeExporter commented 9 years ago
ps. I've just tried adding g.close() to the crawler, and rebuilding the db... 
same
result.

Original comment by danbrick...@gmail.com on 29 Dec 2009 at 6:33

GoogleCodeExporter commented 9 years ago
I see the same behavior assuming the store has been populated the graph with 
some 
triples (attached). I wonder if there is some sort of cartesian join going on 
w/ this 
query?

ed@curry:~$ time ./foo.py 
/home/ed/Projects/rdflib/rdflib/store/MySQL.py:10: UserWarning: MySQLdb is not 
installed
  warnings.warn("MySQLdb is not installed")
/home/ed/Projects/rdflib/rdflib/store/AbstractSQLStore.py:5: 
DeprecationWarning: the 
sha module is deprecated; use the hashlib module instead
  import sha,sys, weakref
/home/ed/Projects/rdflib/rdflib/sparql/Query.py:1: DeprecationWarning: the sets 
module 
is deprecated
  import types, sets, sys
http://danbri.org/words/network http://inkdroid.org/journal/network 
http://danbri.org/
http://inkdroid.org/journal/network http://danbri.org/words/network 
http://danbri.org/

real    0m23.067s
user    0m22.610s
sys 0m0.150s

Original comment by ed.summers on 29 Dec 2009 at 7:00

GoogleCodeExporter commented 9 years ago
Some rudimentary profiling with the Python profile module shows that it is 
spending 
most of its time calling DBCursor.next. That query generated 25,610,678 calls 
to 
next()!

I wonder if other sparql enabled triplestores optimize that query any 
better...I've 
attached the profile dump if you are interested in taking a look:

import pstats
s = pstats.Stats('sparql.prof')
stats.sort_stats('time')
stats.reverse_order()
stats.print_stats()

Original comment by ed.summers on 29 Dec 2009 at 7:39

Attachments:

GoogleCodeExporter commented 9 years ago
For a point of reference I tried the same query, over the same data with arc2. 
arc2 
doesn't return any results at all for your query (php attached) I wonder if 
this isn't 
so much a bug so much as a whacked query...

Original comment by ed.summers on 29 Dec 2009 at 8:29

Attachments:

GoogleCodeExporter commented 9 years ago
Just tried this in jena / ruby. I certainly get results, and much quicker than 
this.

Original comment by pl...@mac.com on 30 Dec 2009 at 6:33

Attachments:

GoogleCodeExporter commented 9 years ago
From MacTed in IRC, re whether the query is legit or not

<MacTed> E: query looks OK, based on quick test on
[http://uriburner.com/sparql?default-graph-uri=&should-sponge=&query=PREFIX+foaf
%3A+%3Chttp%3A%2F%2Fxmlns.com%2Ffoaf%2F0.1%2F%3E%0D%0A+SELECT+%3Fsrc1+%3Fsrc2+%3
Fx+%0D%0A+WHERE+{%0D%0A+GRAPH+%3Fsrc1+{+%3Fgr1+foaf%3Amember+[+foaf%3Aopenid+%3F
x+]+}%0D%0A+GRAPH+%3Fsrc2+{+%3Fgr2+foaf%3Amember+[+foaf%3Aopenid+%3Fx+]+}%0D%0A+
FILTER+%28+%3Fsrc1+!%3D+%3Fsrc2+%29%0D%0A+}&format=text%2Fhtml&debug=on&timeout=
|URIburner]
(different data)

http://chatlogs.planetrdf.com/swig/2009-12-29.html#T20-58-23

The query seems to run (quickly) in Virtuoso if you go here:

http://uriburner.com/sparql?default-graph-uri=&should-sponge=&query=PREFIX+foaf%
3A+%3Chttp%3A%2F%2Fxmlns.com%2Ffoaf%2F0.1%2F%3E%0D%0A+SELECT+%3Fsrc1+%3Fsrc2+%3F
x+%0D%0A+WHERE+{%0D%0A+GRAPH+%3Fsrc1+{+%3Fgr1+foaf%3Amember+[+foaf%3Aopenid+%3Fx
+]+}%0D%0A+GRAPH+%3Fsrc2+{+%3Fgr2+foaf%3Amember+[+foaf%3Aopenid+%3Fx+]+}%0D%0A+F
ILTER+%28+%3Fsrc1+!%3D+%3Fsrc2+%29%0D%0A+}&format=text%2Fhtml&debug=on&timeout=

though over different data...

Original comment by danbrick...@gmail.com on 30 Dec 2009 at 7:36

GoogleCodeExporter commented 9 years ago
Filed this under 'problems' in F2F wiki,
http://wiki.foaf-project.org/w/F2FPlugin#Problems 

Original comment by danbrick...@gmail.com on 30 Dec 2009 at 8:41

GoogleCodeExporter commented 9 years ago
re ARC, http://twitter.com/bengee/status/7362577915 

Q: @bengee is this query ok in arc2?
A: The query is fine and used to run in ARC, too. It'll work again in the next 
revision.

Original comment by danbrick...@gmail.com on 4 Jan 2010 at 3:11

GoogleCodeExporter commented 9 years ago
Thanks for the SPARQL sanity :-)

I guess the next thing is to see if the query ran any faster w/ the old 
Bison-based 
SPARQL implementation to see if the time that it is taking is the result of the 
switchover to PyParsing.

Original comment by ed.summers on 4 Jan 2010 at 3:20

GoogleCodeExporter commented 9 years ago
OK, so the ARC issue was fixed today: 
'''
A new ARC2 revision is now available [1]:

* ARC2 (fix): splitURI was too greedy on special namespaces 
  like xhtml or atom (thx to Michael Panzer)
* SelectQueryHandler (fix): new block parentheses-ing broke 
  regex in FILTER rewriter (thx to Dan Brickley)

Cheers,
Benji

[1] http://arc.semsol.org/download/notes
'''

Original comment by danbrick...@gmail.com on 4 Jan 2010 at 7:41

GoogleCodeExporter commented 9 years ago

Original comment by eik...@gmail.com on 1 Feb 2010 at 7:21

GoogleCodeExporter commented 9 years ago

Original comment by eik...@gmail.com on 1 Feb 2010 at 8:01

GoogleCodeExporter commented 9 years ago

Original comment by eik...@gmail.com on 11 Feb 2010 at 4:16

GoogleCodeExporter commented 9 years ago
These issues involve bits that have been moved out of rdflib proper for now. We 
will re-open them 
or move them to rdfextas as appropriate.

Original comment by eik...@gmail.com on 11 Feb 2010 at 6:06