antonmks / Alenka

GPU database engine
Other
1.17k stars 120 forks source link

Faster joins #38

Closed antonmks closed 11 years ago

antonmks commented 11 years ago

I added an option to sort database segments locally on a join key. The best performance can be reached by sorting the data file on a filter key and then sorting segments locally on a join key. This way the expensive sorting of the datasets can be avoided in join operations. So, for example, the line STORE A INTO 'lineitem' BINARY; now becomes STORE A INTO 'lineitem' BINARY SORT SEGMENTS BY orderkey;

Running time of Q3 changed from 15 to 7 seconds, Q5 from 68 to 17 seconds.

AlexeyAB commented 11 years ago

Good news! It sorts the data in every segment seperately, and can be used an additional to "sort -t "|" -k11 lineitem.tbl > lineitems.tbl" to sort all data by field "shipdate" which reduces time for FILTER?

And by using thrust::is_sorted() before join we can know, do we need to sort data or no, is it correct?

antonmks commented 11 years ago

Yes. It creates a file *.sort where it stores the indexes of sorted columns. So when doing a join it checks if a segment is already sorted by looking into the file. Using thrust::is_sorted probably would be more expensive, it would have to check all data in a column.

Anton

Подольск ?

On Thu, Jul 4, 2013 at 3:44 PM, AlexeyAB notifications@github.com wrote:

Good news! It sorts the data in every segment seperately, and can be used an additional to "sort -t "|" -k11 lineitem.tbl > lineitems.tbl" to sort all data by field "shipdate" which reduces time for FILTER?

And by using thrust::is_sorted() before join we can know, do we need to sort data or no, is it correct?

— Reply to this email directly or view it on GitHubhttps://github.com/antonmks/Alenka/issues/38#issuecomment-20475062 .

AlexeyAB commented 11 years ago

Stored in the "*.sort-file" indexes, which uses with "thrust::gather/scatter" to produce sorted segment?

Yes, and you, Moscow?

antonmks commented 11 years ago

The sort file simply has one or more numbers like 0 - meaning that a segment is already presorted on first field. 03 - presorted on first and third fields.

Yes.

AlexeyAB commented 11 years ago

.sort is a metadata-file. Why not to use bitmask, an example 5(101) mean that on 1st and 3d fields is already sorted?

А как догадались, что с Подольска? :)

antonmks commented 11 years ago

Yes, it is possible to use a bitmask, although the effect on performance would be nil. It is just a list of sorted fields.

Посмотрел Google analytics, там есть города посетителей. А почему Подольск ? С такими интересами в Сан Франциско было бы интереснее и лучше :-)

AlexeyAB commented 11 years ago

I can't read it :)

antonmks commented 11 years ago

Updated.

AlexeyAB commented 11 years ago

Вариант. Но не так хорош разговорный английский, и сейчас не так далеко от Москвы, пока смотрю найдется ли здесь достойное применение. А сами не думали о Сан Франциско? :)

antonmks commented 11 years ago

Английский - ерунда, учится быстро. А мне никак по семейным обстоятельствам :-(