TinyDB query api / Should we change the query API ?

angru commented 8 years ago

Hi, first of all, please exuce my poor English.

I was looking at TinyDB query api and I think it looks pretty nice. So i trying to reproduce their api in terms of lifter. WIP you can find here. It is Python 3.5 only yet but it is not so hard to port to older versions. It is also lazy(almost). I understand it is very different from lifter queries and not perfect in some cases(brackets in complex queries, q/p/a objects) but it will be great to know your thoughts about this. Or maybe you can find some inspration in this.

In my mind it looks something like this:

usage

import lifter
from lifter.tiny import TinyQuerySet, q, p, a, Order

from tests.fake_data import fake

manager = lifter.load(fake, queryset_class=TinyQuerySet)

Please note that q/p/a are important entities that using in different methods

q = Query()  # filter, exclude, get
p = Path()  # order_by, values, values_list
a = Aggregation()  # aggregate

filter

# Query object should bee used
manager.filter(q.age < 30)
manager.filter((q.name == 'Manny') & (q.has_beard == True))
manager.filter((q.name == 'Manny') | (q.name == 'Fran'))
manager.filter(q.name == 'Manny').filter(q.has_beard == True)

get

# will return None if value does not exists
# or fist object if multiple objects returned
manager.get((q.name == 'Fran') & (q.gender == 'female'))
manager.filter(q.has_beard == False).get(q.gender == 'male')

exclude

manager.exclude((q.gender == 'male') & (q.age > 21))

order_by

manager.order_by(p.age, Order.DECS)

count

manager.filter(q.gender == 'female').count()
manager.filter(q.gender == 'male').count()

exists

# I have some doubts about this method

first

manager.filter(q.company.name != 'blackbooks').first()

last

manager.filter((q.age > 30) & (q.age < 50)).last()

values

# will return a list of dictionaries as follow:
# [
#     {'name': 'Bernard', 'email': 'bernard@blackbooks.com', {'company.name': 'house of congress'},
#     {'name': 'Manny', 'email': 'manny@blackbooks.com', {'company.name': 'house of congress'},
# ]
manager.values(p.name, p.email, p.company.name)

values_list

# will return a list of tuples as follow:
# [
#     ('Bernard', 'bernard@blackbooks.com')
#     ('Manny', 'manny@blackbooks.com')
# ]
manager.values_list(p.name, p.company.name)

# will return a list as follow:
# ['Bernard', 'Manny']
manager.all().values_list(p.name, flat=True)

distinct

# will return ['blue', 'brown', 'green', 'purple']
manager.order_by(p.eye_color).values_list(p.eye_color, flat=True).distinct()

spanning lookups

# will filter users with a company whose name is "blackbooks"
manager.filter(q.company.name == 'blackbooks')

# return a list of all companies names, without duplicates
manager.values_list(p.company.name, flat=True).distinct()

complex lookups

# return all users older than 37
manager.filter(q.age > 37)

# exclude all users under 43
manager.exclude(q.age < 43)

# return all users between 21 and 27 years old
manager.exclude(q.age.test(lambda age: 21 <= age <= 27))

# return users with brown or green eyes
manager.filter(q.eye_color.test(lambda c: c in ['brown', 'green']))

# leave only users whose age is odd
manager.exclude(q.age.test(lambda v: v % 2 == 0))

aggregations

# return the total number of children of all users combined, like this:
# {'number_of_children': 215}
manager.aggregate(a.number_of_children(sum))

# {'avg_age': 44.26229508196721, 'children': 215}
from statistics import mean
manager.aggregate(avg_age=a.age(mean), children=a.number_of_children(sum))

# [215]
manager.aggregate(children=a.number_of_children(sum), flat=True)

So what do you think?

agateblue commented 8 years ago

First of all, thank you for your comprehensive issue. Your English is not bad at all (not that I am the best judge for that ;).

I've never heard about TinyDB before, but the query language really reminds me what is done within SQLAlchemy.

This is a different way to express queries, with its pros and cons comparing to Django's way to do it (which, by the way, I've never seen elsewhere).

I really appreciate the readability of this:

manager.filter(q.employee.age > 37)

Versus this:

manager.filter(employee__age=lifter.gt(37))

Using explicit query objects everywhere also prevent some issues. Using lifter, I don't know if's even possible to query properly an attribute that has an underscore in it like this (and I don't wanna know):

# 3 underscores. Ugly!
manager.filter(employee___private_attribute=42)

However, this is a little more verbose:

import lifter
lifter.load(data).filter(lifter.q.field == value)

Than:

import lifter
lifter.load(data).filter(field=value)

There is a decision to make here:

Should we build multiple query APIs in the core?
If not, which API do we choose?

To be honest, I don't think we should offer more than one way to make queries in lifter. Maintaining multiple ways to do the same things means we have to test everything twice, hard more issues, more work, more of everything. This is a point I'm ready to discuss, but at least you have my current point of view (it can evolve).

However, since the project is still really young and in alpha-state, it's the perfect time to make hard, backward-compatibility breaking decisions. If there is a moment when we should choose to switch from one way to do queries to another, it's now.

SQLAlchemy is also a popular ORM out there, so is TinyDB and maybe a lot of other implementions that use a similar query API.

There is no reason to stick do Django's way to do things if a technically better option is available and if more developpers have already used a similar API.

I'm glad you shared your working example here, because it raises those two interesting questions, for which I have absolutely no fixed answer.

I'd really appreciate if other people could share there thoughts about this.

Meanwhile, @angru, would you consider continue the work on this API, and especially unittesting your proposal? I just want to be sure that besides the change to the way we express queries, their effect is absolutely the same. This would mean forking lifter, replacing lifter query module with your own, and updating existing unittest accordingly.

There is no urge though, and I would totally understand if you prefer to wait until a decision is made before continuing the work on this.

Anyway thank you for your contribution!

agateblue commented 8 years ago

I tried your code and everything works as expected. The only part that you do not handle is querying against iterables (see #7 for more information). I must admit I'm really tempted to bundle the whole thing under a lifter.contrib.tiny right now and see if people adopt these over the built-in one. It would give it more visibility than just calling for feedback here.

Do you think it is a good idea ?

agateblue commented 8 years ago

Okay, @angru don't bother writing unittests for it, I've made a whole new branch using your code and updated the whole test suite to use this query API: https://github.com/EliotBerriot/lifter/tree/feature/tiny.

I had to make little changes, for example to make the queryset store its data after it has been evaluated for the first time. Tests pass on Python 3.4 and 3.5 and fail on previous versions, I need to look why.

Feel free to send pull requests to the feature/tiny branch.

agateblue commented 8 years ago

I've pushed a few commits that implements some changes to the way queries are made using the API.

In your code, almost everything was inheriting from Path such as Aggregation and Query.

I don't think it was entirely relevant since a Query is run against a Path but is not a path by itself. The same thing goes for aggregation.

I also didn't like having a single instance of Query, Path and Aggregation used everywhere (the p/a/q objects).

By thinking about these two points, I've found another way to achieve the same thing, which seems more elegant to me, and similar to what is done in SQLAlchemy:

from lifter import tiny

# We create a Model class for handling our data
User = tiny.Model('User')
manager = User.load(values)

# Filtering
manager.filter(User.name == 'Gengis')

# Ordering (I got rid of the Order.ASC and Order.DESC here), that's still one less import
manager.order_by(User.age, reverse=True)

# Aggregating
manager.aggregate((User.age, sum), (User.age, mean))

What I like with this approach is that we keep imports to a bare minimum (no need to import q, p and a objects). It's also easier to know what kind of data we are querying thanks to the model name.

If we want shorter code, we can still do:

from lifter import tiny

M = tiny.Model('User')
manager = M.load(values)

manager.filter(M.name == 'Gengis')

Having an explicitely declared model also enable some advanced feature such as default ordering or annotating:

M = tiny.Model('User', ordering='-age')
manager = M.load(values)

# Because we use an intermediate model, we can store additional data on model instances
# without touching the data we load in the manager
manager.annotate(name_length=lambda user: len(user.name)).order_by(User.name_length)

What do you think?

agateblue commented 8 years ago

(Sorry about the spam).

I've continued the work on the API, and I've decided that for now, we will keep both API available. It's not that hard to maintain both, since internally all django-like lookups are converted to the new one.

All tests pass (at least under Python3), which means that the change is fully backward compatible.

EDIT: okay, now everything pass under all versions of Python

angru commented 8 years ago

Hi, sorry about the silence, I was in three-day hike.

I was going to continue work on this issue(and unittests of course) but I didn't expect you will be so intersted and I'm especially didn't expect that you will accept these backward-compatibility breaking changes. It's great news and I agree with your corrections. Though there is some thing not fully clear to me. Do you agree with pretty ugly syntax of complex queries?

# OR query(brackets requred)
young_and_old_users = manager.filter((User.age < 14) | (User.age > 60))
# of course we can use chaining but it's not perfect too
young_and_old_users = manager.filter(User.age < 14).filter(User.age >60)

agateblue commented 8 years ago

At the moment, the change is not a backward-compatibility breaker. Internally, all keyword-queries are converted to these new, explicit queries. See http://lifter.readthedocs.org/en/latest/overview.html#supported-queries for more informations.

Regarding complex queries, I made the problem disapear for AND queries, since get, filter and exclude now accept an arbitrary number of *args and merge them using AND:

# Internaly, the engine cast this to  (User.age < 14) & (User.is_active == True)
young_and_active_users = manager.filter(User.age < 14, User.is_active == True)

For OR queries though, the brackets will remains until we find a more elegant solution. Be careful, your example that use chaining will return no user, since manager.filter(User.age < 14).filter(User.age >60) means "Filter use who are under 14 AND over 69).

Also, the fact is, using the keyword engine, there was no way to perform these OR queries, and doing these kind of things using django's ORM would result in pretty much the same thing: Q(field=value1) | Q(field=value2).

I hope you had a good hike! I'm sorry I didn't wait for your answer, but I had time to work on this, and since it was blocking pretty much anything else, I decided it was a top priority.

At least now, we have a fully working engine we can start building more complex and efficient queries on.

agateblue / lifter