SearchFilter too optimistic `.distinct()` causes performance issues.

maciej-gol commented 8 years ago

I've been recently debugging weird slowdowns in my app, and I've found that when using SearchFilter the queryset gets always duplicated filtered out (that's a change from 3.1.x). That's cool, but I think this is too optimistic, and it should be the users that should apply that particular filter.

Why? The PR #2535 has fixed duplicate results of M2M fields, but it doesn't check if any of these fields are used! This results in SELECT DISTINCT queries in situations when the user knows the queryset can't produce duplicates, and this leads to huge performance loss when paginating - there are two queries ran:

When getting the total objects count from the queryset

SELECT COUNT('*') FROM (SELECT DISTINCT ...)

When obtaining a page

SELECT DISTINCT ... LIMIT 10

What these queries do (on PostgreSQL) is getting all items, removing duplicates and then applying either COUNT or LIMIT. If the data we are working with is huge, this leads to deduplicating lots of data leading to huge performance loss.

I think that the best solution would be for users to manually apply such filter when needed, or detect whether any M2M field is being filtered on (and mention about it in the docs as, for me, this is pretty important that SearchFilter deduplicates the queryset).

tomchristie commented 8 years ago

We're not going to win here - either way around we'll have issues filed. Either:

"It's incorrect"

Or

"It's unnecessarily slow"

On balance I think staying with the current behavior is going to be the right trade off for the majority of users. Anyone bumping into performance issues can drop down to more explicit filtering or similar.

maciej-gol commented 8 years ago

"It's incorrect"

I'd argue whether this is incorrect behaviour or lack of knowledge of relational algebra and how RDBMS work. I'd rather have people encounter this issue, learn the lesson with duplicates when joining via M2M fields, and save further issues originating from the very same fact, but in different places.

Anyone bumping into performance issues can drop down to more explicit filtering or similar.

I think that it'd be way easier for users that experience duplicates to just throw additional .distinct() at their queries as opposed to users that don't need that distinct having to create a whole new SearchFilter that's basically a copy-paste of the original one, but one line (it's impossible to easily disable the distinct filter).

I dare to think that there are actually less users filtering via M2M fields than those that suffer from unnecessarily slower queries. Anyway, can we atleast have it mentioned in the docs that SearchResult removes duplicates thus can perform sub-optimal? Or even add a second search filter that does not apply the distinct filter so that when users go through the docs they are explicitly told that if they don't need duplication, they get huge performance gains by disabling deduplication?

tomchristie commented 8 years ago

I dare to think that there are actually less users filtering via M2M fields than those that suffer from unnecessarily slower queries.

Agree. But the effects of on are worse than the effects of the other. I'd be happy to consider simple enhancements to code or docs, but we've eliminated a clear behavioral bug for users here so I don't think the basic behavior should vary from that.

maciej-gol commented 8 years ago

What about the solution with a second search filter? This should make people that are not familiar with this issue stop for a while and think about it the reasons people have added two filters that do basically the same but a distinct filter. This really saddens me that there is no way to disable the filter explicitly and have people suffer about O(N logN) slow down.

carltongibson commented 8 years ago

This comes up on Django Filter all the time.

I've basically come round to allowing an attribute on the Filter that says whether to call distinct in filter or not. (This defaults to true because correct is strictly greater than fast.)

It's not much of an addition... — but half of me wonders whether anyone doing search with the ORM can be that worried about speed... — i.e. I'm half inclined to say it's not DRF's business working around this.

carltongibson commented 8 years ago

A simple (genuine) search option uses Django Watson wrapped with a Django Filter MethodFilter.

maciej-gol commented 8 years ago

It's not much of an addition... — but half of me wonders whether anyone doing search with the ORM can be that worried about speed... — i.e. I'm half inclined to say it's not DRF's business working around this.

Because when you've got a base result set of 300 thousands rows and searching on it is blazing fast, yet deduplicating 300 thousands rows (of no duplicates) with 10 columns raises the query time from 10ms to 4 seconds and in this case the deduplication brings nothing good.

carltongibson commented 8 years ago

@maciej-gol Sure.

First thing is to add a non deduping SearchFilter to your own project. (Initial problem solved).

Next look at Django Filter and see how the distinct attribute of Filter controls the distinct call on the QuerySet. (Or guess, as I say, "it's not much of an addition".)

Finally open a PR doing that here, with tests and a note in the docs.

maciej-gol commented 8 years ago

If so, then I think that OrderingFilter has to be fixed too, it doesn't work as expected either.

carltongibson commented 8 years ago

That makes sense.

encode / django-rest-framework

SearchFilter too optimistic `.distinct()` causes performance issues. #3480