[Performance] DISTINCT with a LIMIT

forcedotcom / phoenix

BSD 3-Clause "New" or "Revised" License

558 stars 227 forks source link

[Performance] DISTINCT with a LIMIT #679

Open haridsv opened 10 years ago

haridsv commented 10 years ago

I noticed that a DISTINCT query (without an ORDER BY) takes the same amount of time whether there is a LIMIT or not. If we push the LIMIT clause to the server instead of handling it on the client side, I would think we could gain a big performance improvement. This is probably the case with GROUP BY as well (untested).

Even when there is an ORDER BY, there could still be a slight benefit in pushing the LIMIT down, as long as the ORDER BY itself is also pushed down.

jtaylor-sfdc commented 10 years ago

Yes, same issue with GROUP BY, as DISTINCT is implemented by doing a GROUP BY.

You can't push the LIMIT down when there's an ORDER BY, though. The ORDER BY is pushed down, but a merge happens on the client, as that's where we get the results back from all the regions.

haridsv commented 10 years ago

When ORDER BY is used, won't there be a slight benefit in network data and the subsequent merge performance, if the LIMIT is pushed down?

jtaylor-sfdc commented 10 years ago

Good question, @haridsv. In the case of ORDER BY with GROUP BY, the GROUP BY is done first, per parallel scan (a region's worth of data at most) to prune down the data size. Then a sort in this case is done on the client (again, per parallel scan), followed by a merge sort on the now sorted data. The sort is done on the client in this case to prevent a corner case in which a split occurs while the scan is running. In that case, if we sorted on the server, we could get duplicate, out of order rows. Sorting on the client prevents that possibility.