Feature: Query builder.

ghost commented 7 years ago

Hey,

I'd be very interested to know if people have already considered creating a query builder API?

I think it would help adoption if there was a convenient way to programatically build queries from within python in a LINQ kinda way.

Having such a DSL would not only allow for quicker generation of non prepared queries, as the entire parsing step would be omitted, it could also provide better tool support, as IDE's like pycharm would be capable of autocompleting and type checking queries.

I'd imagine something in the style of.


graph = rdflib.Graph()

query = SELECT(
                  v.s.AS(v.middle_aged_friend)
              ).WHERE(
                  (v.person, FOAF.knows, v.s),
                  (v.person, FOAF.age, v.age),
                  FILTER(l(35) < v.age)
                  FILTER(v.age < l(65))
             )

tim = r("http://www.w3.org/People/Berners-Lee/card#i")

result = g.query(q , initBindings={'person': tim})

Which could further be simplified to the following, due to the first class nature of queries and their components.


graph = rdflib.Graph()

tim = r("http://www.w3.org/People/Berners-Lee/card#i")
query = SELECT(
                  v.s.AS(v.middle_aged_friend)
              ).WHERE(
                  (tim, FOAF.knows, v.s),
                  (tim, FOAF.age, v.age),
                  FILTER(l(35) < v.age)
                  FILTER(v.age < l(65))
             )

result = g.query(q)

I'd be more than happy to start working on this 😄.

Thoughts, suggestions, comments?

joernhees commented 7 years ago

I think @gromgull had something similar he once showed to me...

I think overall there were several attempts to be able to write SPARQL queries in a more pythonic way already, but never really followed through and made it "feature complete", so I'd say contribs always welcome...

Some thoughts:

I'd probably still user longer names (users can shorten them via import ... as ... if they really want to, but it will reduce the amount of "WTFs" for newcomers a lot) and realize this with a QueryBuilder class that supports chaining:

from rdflib import *
from rdflib.query_builder import QueryBuilder, FILTER, SUM
Var = Variable

tim = URIRef("http://www.w3.org/People/Berners-Lee/card#i")
qb = QueryBuilder().SELECT(
    middle_aged_friend=Var('s'),
    some_sum=SUM(VAR('age')),
).WHERE(
    (tim, FOAF.knows, Var('s')),
    (tim, FOAF.age, Var('age')),
    FILTER(Literal(35) < Var('age')),  # this will probably be the most challenging
    FILTER(Var('age') < Literal(65)),
).LIMIT(10)

I'd recommend not to use v.X but Var['X'] or Var('X'), as the . notation at some point always causes people to run into collisions with method names of Variable. I know that pycharm will probably auto-complete the . notation though, so i'm not entirely sure what's best there...

Also, filter is a python keyword, one could hackishly use FILTER, but it might confuse people :-/

And last but not least, implementing all FILTER operators on rdflib terms like Literal, BNode, URIRef and Variable will for sure cause some huge headaches as some of the ops are already defined (in different ways). Maybe one could go for separate implementations of the terms for the query builder, but that also has a huge potential of confusing people.

Thinking about especially the last point, the SPARQL parser, algebra and operators come to mind... maybe parts of that can be re-used...

ghost commented 7 years ago

I'd probably still user longer names (users can shorten them via import ... as ... if they really want to, but it will reduce the amount of "WTFs" for newcomers a lot) and realize this with a QueryBuilder class that supports chaining.

Agreed on both points 😄.

I'd recommend not to use v.X but Var['X'] or Var('X'), as the . notation at some point always causes people to run into collisions with method names of Variable. I know that pycharm will probably auto-complete the . notation though, so i'm not entirely sure what's best there...

Yeah I took, the example from telescope, but am not a fan of such obscure metaprogramming either 😅. I'd also go with Var('X') as it is the most straightforward.

Also, filter is a python keyword, one could hackishly use FILTER, but it might confuse people :-/

Yeah, but the tradeoff seems fair, also the UPPERCASE notation makes the query stand out somewhat nicely from the rest of the code.

And last but not least, implementing all FILTER operators on rdflib terms like Literal, BNode, URIRef and Variable will for sure cause some huge headaches as some of the ops are already defined (in different ways). Maybe one could go for separate implementations of the terms for the query builder, but that also has a huge potential of confusing people.

I would probably banish everything related to the query builder into it's own namespace, as this causes the least confusion and makes autocomplete a lot cleaner. The problem is both ways, not only would operators collide with existing methods, but exposing the internal machinery to IDE autocompletion also seems like a unclean solution. Users should be able to focus on building their queries, and every suggested completion should produce at least a valid query part (of course the whole thing might still be syntactically inconsistent but that doesn't seem solvable).

Thinking about especially the last point, the SPARQL parser, algebra and operators come to mind... maybe parts of that can be re-used...

Yeah I was hoping of just generating some intermediate representation between the parser and the query->algebra compiler, as of now that does require generating pyparsing parse trees, as it seems. Any pointers to things I might have missed there would be greatly appreciated 😁

Thanks a lot for the feedback and time! 😊

ghost commented 7 years ago

Ok I thought about this a bit more. I think the best way to go is to have a QueryBuilder module that has pluggable output formats, for now it would just output sparql strings. The next step would be an optimisation that outputs pyparsing trees. And finally one could refactor the algebra module to accept a more generic intermediate representation that is also used by the query builder and a bit cleaner than the pyparsing IR. This last change would however add an additional pyparsing to clean IR translation step which might add a bit of performance overhead of walking the tree an additional time.

Thoughts?

nemanjavuk commented 7 years ago

Did you have in mind something similar to this: https://github.com/sbg/sparqb ? The team at Seven Bridges (http://sevenbridges.com) that I'm a part of has been working on this library for a while now (still in alpha stage though) that we use internally for our projects. Among some additional features, that aren't pushed to this repo but that are developed, is to use RDFLib Terms when constructing queries. Currently, we only serialize query objects to SPARQL but additional serialization options shouldn't be hard to implement. We also use RDFLib extensively so maybe we can combine our efforts?

gromgull commented 7 years ago

I use this in several projects : https://gist.github.com/gromgull/70e2d0500fbcb820dd99

it's not really complete enough to go into RDFLib yet - but could easily be

ghost commented 7 years ago

I think I would try to build something that is maybe a bit more complex but modeling the syntax of SPARQL more closely? It would be really nice if the user was guided by IDE autocompletion to write queries incrementally and correctly. It would be interesting to compare both existing libs in that regard.

My main use case is didactial, it's somewhat hard to get people to learn SPARQL outside of the SemWeb community especially when it comes to embedded stuff, so making it as easy as possible would be amazing.

gromgull commented 7 years ago

My main purpose is avoiding stupid typos, since you can construct the query from RDFLib terms, you can use a ClosedNamespace, and it will flag stupid errors early.

Of course it's also nice that it's now python code, so it indents nicely etc. And it's a bit more compact than SPARQL strings (but not much)

anubhavj99 commented 4 years ago

I have started working on this issue along with @Arshdeep25 and @saksham16085. Developing a query-builder taking guidance from the examples provided above.

Functions added:

SELECT
WHERE
OPTIONAL
UNION
ORDER BY, GROUP BY
AGGREGATES - (functions)
FILTER - (functions), (operations)
LIMIT, OFFSET
FOR_GRAPH - setting graph for a query using
MOVE, ADD
INSERT, DELETE
NESTED QUERY (example)

Can you please help us by clarifying other expectations for the same.

FlorianLudwig commented 4 years ago

Hi @anubhavj99 ,

I was looking for exactly this - awesome! I am playing around with it and a question came to my mind: Why is there a class QueryBuilder that builds all types of queries instead of one class per query type? It seemed more straight forwarded to have one base class and then subclasses for Select, Insert etc. from a user of the API as well as from the implementation side. It would also reduce the amount of nonsensical queries you can build like:

query = query_builder.QueryBuilder().SELECT(...).INSERT(...)

magdasalatka commented 4 years ago

Wow! Exactly what I was looking for.

@anubhavj99 Do you know what's the status? Your PR has been open for half a year now...

RDFLib / rdflib

Feature: Query builder. #790

Functions added: