cznic / ql

github.com/cznic/ql has moved to modernc.org/ql
https://godoc.org/modernc.org/ql
BSD 3-Clause "New" or "Revised" License
1.31k stars 75 forks source link

Doing a query for a range of a float64 field is unacceptably slow #70

Open twitchyliquid64 opened 10 years ago

twitchyliquid64 commented 10 years ago

This is my attempt at storing and using open street map data for my city, btw. There are over a million nodes. I want the following query to be as fast as possible (<150ms when the db is in RAM) because there are other queries to do nessesary to render a tile.

If I have a table:

CREATE TABLE IF NOT EXISTS Nodes ( NodeID int64, Lat float64, Lon float64, Changeset string ); CREATE INDEX IF NOT EXISTS NodesID ON Nodes (NodeID); CREATE INDEX IF NOT EXISTS NodesLat ON Nodes (Lat); CREATE INDEX IF NOT EXISTS NodesLon ON Nodes (Lon);

And I run the query:

SELECT * FROM Nodes WHERE Lon > 151 && Lon < 151.00001

(I will add a Latitude range conditional too when I get this working well).

It takes over a minute to complete, only returning 40 or so rows (out of 1 million). My guess is that ql is generating two sets of data for each WHERE clause and trying to union them, which is taking a while.

Is there a better way I can write my query? or perhaps I am missing an index of some kind? or maybe something that can be done with the backend to speed it up for this kind of thing? (like a RANGE keyword/operator)

Thanks, Tom

cznic commented 10 years ago

The query optimizer doesn't yet step in for this case so the query makes a full table scan. Please try (untested):

SELECT * FROM
        (SELECT * FROM Nodes WHERE Lon > 151)
WHERE Lon < 151.0001

This may help, but only a little.

I'd suggest instead doing (untested):

rs, _, err := db.Run(nil, "SELECT * FROM Nodes WHERE Lon > 151")
if err != nil {
        ...
}

if err := rs[0].Do(false, func(data []interface) (bool, error) {
        if data[LonColNum].(float64) >= 151.0001 {
                return false, nil
        }

        // process data, Lon is in (151, 151.0001)
        return true, nil
}); err != nil {
        ...
}

Using indexes, this should materialize only the rows of the interval and I expect it to be much faster then the inner select hack as the outer select will still have to do a linear scan.

TODO(optimizer)

twitchyliquid64 commented 10 years ago

The problem with doing the last code snippet is that in the average case it generates a temporary results set of 500K nodes (half the db), which still have to be iterated through.

C:\coding\projects\go-osm-parse>ql.exe -db="osmdump.db" "SELECT count() FROM Nodes WHERE Lon > 151" 626562

Just counting the rows returned from the first part of that query took over 30 seconds. Thats to say nothing of the Do().

cznic commented 10 years ago

The select count() will produce that much rows, but the code snippet will not. Try it ;-)

I promise it will create exactly those rows which have Lon in (151, 151.0001), assuming Lon has an existing index.

The first rs, _, err := db.Run() produces zero rows. Yes, creating a recordset does not produce any rows. Rows are generated lazily only when asked for, one by one using the .Do method of the recordset. The count aggregate function must count all rows, so in that case every row WHERE Lon > 151 will be produced.

But in the discussed code, the closure passed to the .Do method returns false as soon as Lon >= 151.001. As the physical index is ascending, no more rows are ever read from the DB, only those in the requested interval.

twitchyliquid64 commented 10 years ago

Impressive :)

Is there a way to use do() from the sql/driver? I'm using that ATM as a compatibility layer

cznic commented 10 years ago

Unfortunately no, there's no way how to do this using database/sql except the query optimizer detects the ranging WHERE (or BETWEEN) and then does it for you automagically.

twitchyliquid64 commented 10 years ago

Dayum. I guess ill either have to wait for the optimizer or use the normal interface to ql for the server application.

Can you recommend any sneaky code for doing the interval select across two fields?

IE: SELECT * FROM Nodes WHERE Lon > 151 && Lon < 151.00001 && Lat < 33 && Lat > 32.8

cznic commented 10 years ago

I have no good idea how to do this as efficiently as the correct use of the indexes can. It really needs the query optimizer to do this work automatically. If you can afford to wait few days or a week or so then maybe I can add this feature in that time. But I have virtually no free time at the moment.

twitchyliquid64 commented 10 years ago

I have no problem waiting :) Thanks

twitchyliquid64 commented 10 years ago

Hey man, I was going to try and tackle this as uni break just started. But then I tried to have a look through your code and was like welp haha

Perhaps you can give me some pointers?

cznic commented 10 years ago

Well, where to start from?

Perhaps please tell me which specific place/functionality/... you want a pointer to and I'll attempt to give you some ;-)

twitchyliquid64 commented 10 years ago

Okies, well perhaps a good place to start would be detecting the latitude < 4, latitude > 3 kinda condition so it can be optimized.

From the looks for things in stmt.go, all the SQL is compiled into a AST. So what I was thinking of doing was adding more code to selectstmt.exec0 to look into each WHERE clause and detect when someone is doing a > or < for the same field twice?

What do you think?

cznic commented 10 years ago

From the looks for things in stmt.go, all the SQL is compiled into a AST. So what I was thinking of doing was adding more code to selectstmt.exec0 to look into each WHERE clause and detect when someone is doing a > or < for the same field twice?

That place handles certain cross join optimization (SELECT ... FROM A, B, ... WHERE) when the number of tables of the cross join match the number of the terms of the WHERE clause, joined by && and every field used belongs to a different table and all of them are indexed. A very special case, not much similar to what I think you're after.

BETWEEN is parsed here and NOT BETWEEN here. In those places there's a chance to rewrite the predicate using relational operators. That's better as then the optimizer will look only on the relation operators and doesn't have to care about BETWEEN as a special/additional case.

Actually BETWEEN is rewritten, but only at evaluation time. In essence, that code can be, modified, executed at parse time.

Top level "optimizer" is here. That's the place where multiple relations on the same indexed field should probably be detected and dispatched to an optimize-attempt function (something like this). Currently the optimizer looks only for single columns with indices and a single relational operation on that (indexed) column. The addition would be in

I think the above might get you started. If not, please feel free to suggest/ask more.

twitchyliquid64 commented 10 years ago

Awesome, that should be enough for me to play around and start looking at how things are operating. Ill fork and start playing when I wake up in the morning (today!)

What do you think of re-writing the other way round - changing an expression like long > 5 && long < 6 into a between operation? I think that might be easier to do because then all I need to do is make the between code in ql.go use indices.

twitchyliquid64 commented 10 years ago

Ah yes, between is rewritten to < and > thats why

cznic commented 10 years ago

Ah yes, between is rewritten to < and > thats why

Specifically, BETWEEN is evaluated that way.

twitchyliquid64 commented 10 years ago

What is lhs? Is it the table/column name? and where is the mysterious expand1 function?

What do you keep in ctx and arg?

cznic commented 10 years ago

expand1 joins blob chunks into a full, non chunked runtime instance. Applies to values of type string, blob (]byte) and bigint, bigrat, ... if they are above certain size (IIRC 256 bytes). Basically, expand1 will look if the data argument is a ql.chunk (that's what a file table row holds) and if so, it loads and returns the true type instance. Memory DBs do not do that (do not store chunks of instances, only full instances for all ql types).

lhs stands for whatever foo is in WHERE foo BETWEEN expr1 AND expr2. In the optimizing case lhs must be a field name.

The current pBetween.eval() is poor. There's a missing func newBetween, which should be called at parse time and which should do the rewrites. At evaluation time it's too late. The top level "optimizer" checks the possibilities before evaluation.

Evaluation ctx (context) maps names to values. Name is a field name and value is the value of that field for the currently evaluated row. Aggregating functions use context for keeping cummulative computed values across rows.

arg is a slice of ql parameters, ie. what was passed to db.Run or db.Exec in place of $1, $2, ...

twitchyliquid64 commented 10 years ago

Okay, starting to get my head around it.

Are the temporary result sets being returned via those callback passed in parameters?

Plan is to copy paste the (r *whereRset) tryBinOp() code and get it working for betweens. As the index is ordered, I should be able start the loop at the lower value and abort when it reaches the higher value?

cznic commented 10 years ago

Are the temporary result sets being returned via those callback passed in parameters?

The rows produced are returned by the callback (f here). Also, field names are returned through the same mechanism.

I'll now create the newBetween to normalize a BETWEEN b AND c into a >= b && a <= c. I think I'll have it ready in an hour or two.

Plan is to copy paste the (r *whereRset) tryBinOp() code and get it working for betweens.

Maybe the new code will go inside this fn. I cannot find quickly where it's checked the bin op is a simple 'field op expr'. There we need to divert for the case 'field op1 expr1 && field op1 expr2', check the field has index and that the op1 is in >, >= and op2 in <, <= (or the other way around).

twitchyliquid64 commented 10 years ago

Awesome, Ill start playing around with the code in that fn so I understand it better. Any idea what tryBinOpID() does vs tryBinOp()? My guess is it has something to do with id().

On Sat, Sep 20, 2014 at 6:09 PM, cznic notifications@github.com wrote:

Are the temporary result sets being returned via those callback passed in parameters?

The rows produced are returned by the callback (f here https://github.com/cznic/ql/blob/b2e4645311b0713db07658ffa7a587e707664656/ql.go#L87). Also, field names are returned through the same mechanism.

I'll now create the newBetween to normalize a BETWEEN b AND c into a >= b && a <= c. I think I'll have it ready in an hour or two.

Plan is to copy paste the (r *whereRset) tryBinOp() code and get it working for betweens.

Maybe the new code will go inside this fn. I cannot find quickly where it's checked the bin op is a simple 'field op expr'. There we need to divert for the case 'field op1 expr1 && field op1 expr2', check the field has index and that the op1 is in >, >= and op2 in <, <= (or the other way around).

— Reply to this email directly or view it on GitHub https://github.com/cznic/ql/issues/70#issuecomment-56260510.

cznic commented 10 years ago

On Sat, Sep 20, 2014 at 10:16 AM, Twitch notifications@github.com wrote:

Awesome, Ill start playing around with the code in that fn so I understand it better. Any idea what tryBinOpID() does vs tryBinOp()? My guess is it has something to do with id().

Correct. tryBinOpID looks for a special case: joins on id() - in contrast to field name handled by tryBinOp.

twitchyliquid64 commented 10 years ago

Okay so to start off im going to work on the test case:

CREATE TABLE test (lat int64, long int64); CREATE INDEX testLat ON test (lat) CREATE INDEX testLong ON test (long)

When the query SELECT * FROM test WHERE lat > 0 (ie: an existing optimized case) is called the call stack is

whereRset.do() -> whereRset.tryUseIndex() -> tryBinOp() before returning to do() and returning from there.

Where the query SELECT * FROM test WHERE lat > 0 && lat < 1 (optimization we are trying to implement) is called the call stack is

whereRset.do() -> tryUseIndex() (returns err) and back to do(), where a linear scan is done.

So now I am going to look at detecting the case we need to optimize in tryUseIndex()

cznic commented 10 years ago

Yep. tryUseIndex sees the binary op && and does not (yet) know what to do with it. We need to teach it to look for bin op in {&&, ||} where both operands are expressions having form field op2 expr or expr op2 field and op2 in {>, >=, <, <=} and field names are the same and that field has index and the crossjoin is a single table.

BTW: In your example, tryUseIndex should not return err != nil, AFAICT.

cznic commented 10 years ago

BETWEEN rewriting is ready, please rebase your fork.

twitchyliquid64 commented 10 years ago

Ah yes, you are right.

if ok, err := r.tryUseIndex(ctx, f); ok || err != nil {

I just spent the last half hour trying to come up with a recursive function to scan the whole parse tree and re-arrange and detect all the things, my train of thought has died so I think thats a bit too ambitious. I'm going to try a spaghetti of switches now.

Fork rebased.

twitchyliquid64 commented 10 years ago

what does expr.isStatic() do? If it does what I think it does, it might be a handy shortcut for me.

cznic commented 10 years ago

isStatic returns whether its argument, a ql.expression can be evaluated statically, ie. at parse time. Another helper is staticExpr. It calls e.isStatic and if that returns true, it evaluates e returning a flat ql.value instead.

cznic commented 10 years ago

Added some comments at your commit.

twitchyliquid64 commented 10 years ago

Awesome, that simplifies things.

I think I have written code that detects the case in my latest commit. Please have a look.

operand1 is the minimum number, operand2 is the max. op1 and op2 are stored so we know if it is ge/le or </>.

cznic commented 10 years ago

I think I have written code that detects the case in my latest commit. Please have a look.

Nice work. Small issue found:

jnml@4670:~$ rm -f ql.db
jnml@4670:~$ ql 'create table t (a string, i int, z string)'
jnml@4670:~$ ql 'select * from t where i > 0 && i < 9'
jnml@4670:~$ ql 'create index ti on t (i)'
jnml@4670:~$ ql 'select * from t where i > 0 && i < 9'
i > 0
i < 9
jnml@4670:~$ ql 'select * from t where i < 0 && i > 9' // <- should not qualify
i > 9
i < 0
jnml@4670:~$ ql 'select * from t where i < 0 || i > 9' // <- should qualify
jnml@4670:~$
cznic commented 10 years ago

It's more complicated then I wrote above. select * from t where i < 0 && i > 9 should qualify* at this stage. In next stage the concrete values (0 and 9) should/could be considered and statically emtpy intervals should return empty record sets. Alternatively, it could be left unhandled and the index filtering will produce nothing. And the first seek is only O(log n).

: Because its equal to `select from t where i > 9 && i < 0`.

twitchyliquid64 commented 10 years ago

Not to worry, ive almost finished writing that conditional. How do I compare one ql.value to see if it is larger than the other?

Dodgey, but:

          ex := &binaryOperation{'<', min, max}
            oath, err := ex.eval(nil,nil)
            if err!=nil{
                return false, err
            }
            if !oath.(bool){
                fmt.Println("Cannot opt - Operations are not for a range")
                return false, nil
            }
cznic commented 10 years ago

collate1 does that. Note that it panics when it cannot collate b/c of incompatible types. Before calling collate1 we need to coerce the static value to the proper type of the other side and check if after the coercing both operands have the same type. Use coerce1. Then we must check if the operation is defined on that type. That's easy, call the original expression, with the newly coerced literals, eval method and look for err != nil. Normaly the binaryOperation.eval does that (coercing, type checking, op defined) at runtime for every row. Here it is enough to do it once before performing the real index-aided filtering.

An expression consisting of a value only is of type value and the value field is val. This field(s) will be handed to coerce1 and collate1.

cznic commented 10 years ago

Ehm, collate1 seems to do coercing for us already.

twitchyliquid64 commented 10 years ago

I havent used collate1, but have a look at my latest commit. Moar bugfixes and features. Havent tested parameters but they should work.

cznic commented 10 years ago

Actually, the easiest way to collate two values is probably like this:

// op eg '<', ge, ...
x, err := newBinaryOperation(op, val1, val2) // does ALL necessary static checks
if err != nil {
       // fail
}

v, err := x.eval(ctx, arg) // Does ALL necessary dynamic checks
if err != nil {
       // fail
}

if v.val.(bool) {
        // val1 op val2 == true
}

Normally it would be too costly, but here we are doing it only once or twice as a preparation step.

I havent used collate1, but have a look at my latest commit. Moar bugfixes and features. Havent tested parameters but they should work.

Will do.

twitchyliquid64 commented 10 years ago

Passing nil, nil here means the eval method cannot evaluate a field name (table column), nor it can access $n params. I suggest to give up on statically rejecting the range (my fault I suggested that before). Let's just index-filter the range. Type checks will fail as discovered. Impossible limits will produce nothing. And that's the right amswer ;-)

If I pass nil, ctx.arg will that fix it or am I missing something?

cznic commented 10 years ago

If I pass nil, ctx.arg will that fix it or am I missing something?

Yes, that should be okay.

Otherwise I think I've finished reviewing your last commit, please see the inline comments. IMO you're getting pretty close to the "real" thing!

twitchyliquid64 commented 10 years ago

Awesome.

I'm going to get started on the method which will do the heavy lifting. Any preferences for how things are done? I was just going to follow the design of tryBinOpID() but without the ID crap.

cznic commented 10 years ago

I'm going to get started on the method which will do the heavy lifting. Any preferences for how things are done?

I suggest to get inspiration from tryBinOp instead. The ID version is derived from it and the ID version handles only one integer type and does not handle nulls (no row id() can be NULL).

twitchyliquid64 commented 10 years ago

Hectic. I'm gonna go to bed and hopefully get on to it tomorrow. Feel free to change whatever you want in the meantime.

twitchyliquid64 commented 10 years ago

I was going to work on this, except I met my ex randomly and we got along way too well and ended up out for dinner.

TL;DR my brain is frazzled, cant work on this tonight :/ sorry

cznic commented 10 years ago

No need to say sorry, you're your only boss in charge of this task ;-)