Whether to do short description of the architecture and implementation?

AlexeyAB commented 11 years ago

Hi Anton! Could you make more comments in the code for the major functions?

Do you think, for the development of the project by other developers need to do a short description of the architecture? Or to describe in the following topics until here:

What are the structure, characteristics and differences of the Alenka store from the classic columnar storage? References describe of data storage: http://dbmsmusings.blogspot.ru/2009/09/tour-through-hybrid-columnrow-oriented.html Presentation by Daniel Abadi: http://www.cs.yale.edu/homes/dna/talks/Column_Store_Tutorial_VLDB09.pdf
What libraries and algorithms uses for each of these operations:
- SORT for numbers/strings: thrust::stable_sort_by_key() ...
- JOIN for numbers: mgpu::RelationalJoin<MgpuJoinKindInner / MgpuJoinKindLeft / MgpuJoinKindRight / MgpuJoinKindOuter>() ...
- JOIN for strings: thrust::gather_if() ...
- FILTER for numbers/strings: thrust::copy_if() ..
- GROUP BY for numbers/strings: ...
How does resolve conflicts when using MurmurHash2_64?
What are the features and differences of Alenka compression from the classic compression implementations PFOR, DELTA-PFOR, Dictionary?

Maybe in the future to create a wiki or describe it here now?

antonmks commented 11 years ago

Yes I can comment the major functions. I'll make a short description of algorithms used for major operations.By the way, join for strings is not done by gather_if function - strings are hashed and then matched by CUDPP or mgpu libraries. gather_if is needed to gather the results from other columns. Also, do you still need writing the results to binary files and NULL processing ?

Anton

AlexeyAB commented 11 years ago

Yes, I still need it. NULL is important for get correct result. And writing the results to binary files need for good perfomance.

And how are resolving the problems with the hash collisions when using MurmurHash2_64, and then matching by CUDPP or mgpu libraries?

Alexey

antonmks commented 11 years ago

Well, about hash collisions - I hash each of the groupby columns and then hash the resulting hashes. Seems to work well.

Anton

AlexeyAB commented 11 years ago

Ok, it's not a priority. Since this is likely to work now with high probability. But in general, in the future, in rows with the equal hashes will have to verify the equal of the values themselves.

Different hashes are saying at 100% that the values were different, but the equal hashes are saying that values are the equal only with some probability.

Alexey

antonmks commented 11 years ago

I implemented writing the results to binary files.

Regards, Anton

AlexeyAB commented 11 years ago

Thanks!

Regards, Alexey

AlexeyAB commented 11 years ago

A little question about hash join - matching and gather are executing on GPU, but why hashing is executing on CPU? It must be ideal for parallel executing on GPU by using functor, which is calling MurmurHash64A() and using thrust::transform.

Regards, Alexey

antonmks commented 11 years ago

Only hashing of strings is executed on CPU in a join operation. Basically it is done because it is faster than transferring long strings to GPU and hashing them there. I'll try to do some benchmarking and see if it makes sense to do the hashing on GPU.

P.S. I did the benchmarking, it seems like both host and device versions of MurmurHash are equally very fast, so there won't be much of a speed-up from moving the hash function to a GPU. It would definitely make sense to do that if the groupby result hashes could fit in GPU memory (i.e. number_of_hashes*8bytes < gpu_memory). If your project fits this requirement then let me know, I will try to implement this feature.

AlexeyAB commented 11 years ago

i.e. now groupby and join using hashing on CPU:

for groupby using unordered_map on CPU and all processing of aggregation functions going on CPU
for join using Murmurhash on CPU, and then sort on GPU and sort merge join on GPU by using MPGU-library. Is it true for all types: strings and decimal/integers?

And what is using for JOIN strings and decimal/integers? Both are using sort and sort merge join (from mpgu) instead of CUDPP?

antonmks commented 11 years ago

For groupby a current segment is aggregated on a GPU and then results are merged with previous results on a CPU using MurmurHash (if data consist of just 1 segments then hashing is not used). For join on integer/decimal keys we use sort merge or hash join (both on GPU). If join keys are strings then strings are hashed on a CPU using MurmurHash and resulting hashes are joined on a GPU.

Default join type is sort/merge. If join fits a star join conditions then we use a hash join ( see Abadi Column-Stores vs. Row-Stores).

Regards, Anton

On Mon, Jun 3, 2013 at 1:33 PM, AlexeyAB notifications@github.com wrote:

i.e. now groupby and join using hashing on CPU:

for groupby using unordered_map on CPU and all processing of aggregation functions going on CPU

for join using Murmurhash on CPU, and then sort on GPU and sort merge join on GPU by using MPGU-library. Is it true for all types: strings and decimal/integers?

And what is using for JOIN strings and decimal/integers? Both are using sort and sort merge join (from mpgu) instead of CUDPP?

— Reply to this email directly or view it on GitHubhttps://github.com/antonmks/Alenka/issues/31#issuecomment-18833091 .

AlexeyAB commented 11 years ago

Yes, I see, you use thrust::reduce_by_key() for groypby on GPU, and thrust::make_discard_iterator() for optimize memory use.

Does in groupby murmurhash uses only for reduce size of (many) fields by which going groupby, and then these hashes will be sorted on GPU and reduce_by_key on GPU? i.e. as I understand:

always all fields by which going groupby hashes by murmurhash
these hashes are sorting on GPU
all processing of aggregation functions going on GPU by using thrust::reduce_by_key()
if data consist many of segments, then results are merged with previous results on a CPU by using hashes from (1) which produced by murmurhash, and inserting these hashes in unordered_map<>.

And about strings - join on strings is possible to use sort merge or hash join (both on GPU)?

antonmks commented 11 years ago

1.Yes. 2.There is no need to sort hashes, we copy them to host and do the update/insert on unsorted_map for every hash value. 3.Yes. 4.Correct.

Join on string keys is done by sorted merge join. Star join uses only integer keys and basically can only be done with hash join.

On Mon, Jun 3, 2013 at 3:24 PM, AlexeyAB notifications@github.com wrote:

Yes, I see, you use thrust::reduce_by_key() for groypby on GPU, and thrust::make_discard_iterator() for optimize memory use.

Does in groupby murmurhash uses only for reduce size of (many) fields by which going groupby, and then these hashes will be sorted on GPU and reduce_by_key on GPU? i.e. as I understand:

always all fields by which going groupby hashes by murmurhash

these hashes are sorting on GPU

all processing of aggregation functions going on GPU by using thrust::reduce_by_key()

if data consist many of segments, then results are merged with previous results on a CPU by using hashes from (1) which produced by murmurhash, and inserting these hashes in unordered_map<>.

And about strings - join on strings is possible to use sort merge or hash join (both on GPU)?

— Reply to this email directly or view it on GitHubhttps://github.com/antonmks/Alenka/issues/31#issuecomment-18837461 .

AlexeyAB commented 11 years ago

OK
But if we use reduce_by_key(), then keys must be sorted necessarily. If it does not, then the only neighboring elements with equal hashes will be grouped , which happens is very rare. And the whole load of the groupby and processing of aggregation functions will be on the CPU at the merging of the results (at 4).
OK, using thrust::reduce_by_key() on GPU...

antonmks commented 11 years ago

Yes, all the columns are sorted on GPU by groupby keys.

AlexeyAB commented 11 years ago

But if all the columns are sorted on GPU by groupby keys (hashes), then no need to use unordered_map for merge them, and we can use thrust::merge and thrust::reduce_by_key on CPU by using thrust::host_vector<>.

merge algorithm much more faster than use unordered_map
thrust::merge and thrust::reduce_by_key can use multicore CPU by OpenMP "backend" https://code.google.com/p/thrust/wiki/HostBackends

Or you can see my little example about using multicore host-backend for Thrust. https://github.com/AlexeyAB/ELICuda

antonmks commented 11 years ago

You might be right, but I'm not sure about the speed of thrust::reduce_by_key on the host. I'll do some benchmarking to see if this makes sense.

Thanks !

Anton

On Mon, Jun 3, 2013 at 4:47 PM, AlexeyAB notifications@github.com wrote:

But if all the columns are sorted on GPU by groupby keys (hashes), then no need to use unordered_map for merge them, and we can use thrust::merge and thrust::reduce_by_key on CPU by using thrust::host_vector<>.

merge algorithm much more faster than use unordered_map

thrust::merge and thrust::reduce_by_key can use multicore CPU by OpenMP "backend" https://code.google.com/p/thrust/wiki/HostBackends

Or you can see my little example about using multicore host-backend for Thrust. https://github.com/AlexeyAB/ELICuda

— Reply to this email directly or view it on GitHubhttps://github.com/antonmks/Alenka/issues/31#issuecomment-18842040 .

AlexeyAB commented 11 years ago

reduce_by_key() - it should not be slower than doing the same thing manually. And the matching "Ordered Match" in reduce_by_key() can not be slower than the "Hash Match" in unordered_map<>.

If we have a sorted array, then there is nothing faster than using merge + reduce_by_key. I think acceleration will be 3 - 5 times. Memory allocation, its fragmentation problem and resolution to conflicts with the hash collisions in unordered_map - are a very expensive operations.

An additional. In my example I use single unified template class for all instantiation, with T_vec parameter, which can be std::vector<>, thrust::host_vector<> and thrust::device_vector<>. And depending on the vector, it will use conventional single-thread CPU std::sort, or multi-thread OpenMP CPU thrust::sort, or MPP GPU CUDA thrust::sort. https://github.com/AlexeyAB/ELICuda/blob/master/ELIdivision.h

antonmks commented 11 years ago

Well, I'm not sure that it is possible to do a merge based on multiple keys of multiple types. Let me know if you know how to do it. Otherwise it still would need a hash of groupby keys.

AlexeyAB commented 11 years ago

We already have:

All key fields hashed to a single array of hashes (of 1) by using murmurhash
This single array of hashes of key fields is sorted (of 2) before using reduce_by_key()
Else we could not use reduce_by_key() (in 3) i.e., keys are always - this is a single sorted array of uint64_t.

And now only fields for aggregating may have different types. And you need to get from the GPU grouped hash value at least once (use device_vector<> instead of thrust::make_discard_iterator() ), like in my function merge_by_key<>() (in Pull Request). Or use my function reduce_merged_by_key<, , false, >() and once reduce_merged_by_key<, , true, >() to get an array of grouped hashes on GPU.

Do you mean that there is not merge_by_key()? I implemented it in Pull Request. I didn't test it and it may contain minor errors.

After reduce_by_key on GPU - we have on GPU: a single array of key-hashes (sorted and reduced), and all fields (reduced and grouped) by those keys. And for do merge_by_key() we need to do:

We copy from GPU to CPU host: a single array of key-hashes (already sorted and reduced), and all arrays of fields (already reduced and grouped) by those key-hashes.
For next segment we do the same thing, and copy it to CPU host
To merge+reduce_by_key for each field: we use merge_by_key<>() and reduce_merged_by_key<>() from my Pull Request.
Use this result of merging and reducing, like have used result from GPU for segment 1, and go to (2)

antonmks commented 11 years ago

Yes, it definitely can be done using hashes. Probably worth it just to get rid of the Boost unordered_map. I'm no sure if hash_keys type need to be a template - it is always of the same type. Also, reduce_by_key needs to be done differently depending on an aggregation type. Anyway, I'll try to benchmark it and see if it makes sense to switch to merge/reduce version of groupby in the next few days

Thanks ! Anton

AlexeyAB commented 11 years ago

Hi Anton! Yes, you can make the type hash_keys to not parameter of template. A differents thrust::reduce_by_key on the GPU done through a last parameter of this function. For COUNT() on GPU - there will need to use something like: sum(decode(field, NULL, 0, 1)) of https://github.com/antonmks/Alenka/issues/21 # issuecomment-18101865

But in my reduce_merged_by_key () on CPU:

COUNT and SUM using thrust::plus - reduce_merged_by_key<,,, thrust::plus >()
AVG does not directly takes part in processing and initially split into COUNT and SUM
MIN and MAX are using thrust::min and thrust::max - reduce_merged_by_key<,,, thrust::min >()

Deviation, variance and their shifted variants:

VAR (exp) = (SUM (SQUARE (exp)) - SUM (exp) * AVG (exp)) / (COUNT_BIG (exp) - 1)

STDEVP (exp) = SQRT ((SUM (SQUARE (exp)) - SUM (exp) * AVG (exp)) / (COUNT_BIG (exp) - 1))

VARP (exp) = AVG (SQUARE (exp)) - SQUARE (AVG (exp))

STDEVP (exp) = SQRT (AVG (SQUARE (exp)) - SQUARE (AVG (exp)))

Where SQUARE (x) = x * x;

Regards, Alexey

antonmks commented 11 years ago

Probably the whole GroupBy op can be made even simpler. Results of current segment's groupby could be copied to a host and simply appended to results of previous segments. After all segments are done the result is sorted(on a host) on a hash key and reduced_by_key. This way one can avoid merging every segment -> better performance. It may require more host memory, but I think this would be a good trade-off.

Anton

AlexeyAB commented 11 years ago

Do you mean that we must do merge for each field, but sort_by_key must do only once, where keys are hashes, by which are sorted sequential indices, and then all fields use thrust::gather instead of sort?

Because in basically, usually the merge faster than sort in 5 - 10 times. The merger - is fast single-pass algorithm.

Alexey

AlexeyAB commented 11 years ago

You can simply appended to results of previous segments, and then do merge_by_key<>(), where instead of fields use sequential thrust::count_iterator<>. In result you get the same indicies for thrust::gather, that same after sort_by_key, but faster in 5 times.

After this you can append to results of previous segments for each field, then do thrust::gather by recieve indicies, it give you sorted fields by hashes.

If you agree with that, I can correct merge_by_key() to this variant.

Alexey

antonmks commented 11 years ago

Well, the merge works only with sorted data so you need to sort it anyway. What I suggest is that the results of each segment's groupby are appended to host results ( no sorts, merges or reduce_by_keys). After all segments are processed the results on the host are sorted by hash key and reduced by hash key. This way we avoid having to call a merge function for every segment.

On Tue, Jun 4, 2013 at 12:48 PM, AlexeyAB notifications@github.com wrote:

Do you mean that we must do merge for each field, but sort_by_key must do only once, where keys are hashes, by which are sorted sequential indices, and then all fields use thrust::gether instead of sort?

Because in basically, usually the merge faster than sort in 5 - 10 times. The merger - is fast single-pass algorithm.

Alexey

— Reply to this email directly or view it on GitHubhttps://github.com/antonmks/Alenka/issues/31#issuecomment-18899181 .

AlexeyAB commented 11 years ago

Yes, benefits of your way (to sort after all):

more simply
not much slower than next variant (merge result with each segment), or may be even faster

Benefits of way with merge for each segment:

simply
use less CPU RAM (reduce data after each segment are processed)

Benefits of way with merge (after all) by each 2 segments / result_of_merge, by using log(count_of_segments) iterations:

a little more difficult
the fastest variant

Alexey

antonmks commented 11 years ago

Just a short update : I'm working on merge/reduce version of groupby.

Anton

AlexeyAB commented 11 years ago

Ok. Thanks.

Alexey

antonmks commented 11 years ago

Merge/Reduce_by_key groupby is done.

Anton

P.S. Nice habrahabr C++ article, congratulations !

AlexeyAB commented 11 years ago

Thanks! :)

AlexeyAB commented 11 years ago

Hi Anton! Did you measured how the performance after the implementation Merge/Reduce_by_key and using pinned_allocator, how much different from the previous one? https://github.com/antonmks/Alenka/issues/33#issuecomment-18836879? Have you tested it on the GeForce Titan?

Regards, Alexey

antonmks commented 11 years ago

Yes, I have measured the performance with pinned allocator. On a 300 GB dataset all queries run exactly 1 second faster. I will install Titan this week and I will post the results. Probably there won't be many changes to the code the next few months, I will be figuring out the best way to implement Insert/Update/Delete operations, quite a few people asked for it.

Regards, Anton

antonmks commented 11 years ago

Results of 300X TPC-H queries on a GTX Titan 6GB :+1: Q1-29 Q2-3 Q3-15 Q5-68 Q6-2 Q7-29

AlexeyAB commented 11 years ago

Excellent!

Generally the time is not much different from the GTX 580. Fully Titan's potential was not disclosed. So it much more can be optimized. If time does not decrease after several starts and full cache, then the weak point of executing of operations is CPU-host.

A relatively weak CPU
The lack of overlap of execution on CPU and GPU
Not all heavy operations execute on GPU (not implemented hybrid executing on CPU-GPU)

Which operation at this moment are executing only on CPU, an example because lack of GPU-RAM, except sort of static strings? Maybe I can speed up something.

antonmks commented 11 years ago

Well, if you look at the queries you will see that queries with just filter and groupby operations run quite fast. (I usually extrapolate the results to 1000 GB and compare with top official TPC-H results like this : http://c970058.r58.cf2.rackcdn.com/individual_results/HP/HP-1TB-Superdome-64c-ES.pdf

look at timings of Stream 0 to compare to Alenka ) The queries with joins runs 2-3 times slower mostly because Oracle/MSSQL etc. use column correlation and can skip most of the data ( although correlation is mostly useless outside of TPC-H benchmark so many consider this a form of cheating). So I wouldn't worry too much about synthetic tests like TPC-H and would rather concentrate on real-life projects I also wish that I could test the program at large volumes of data - but it is currently difficult with my PC.

Best regards, Anton

AlexeyAB commented 11 years ago

What do you mean by correlation columns, for what types of join it works, and what algorithms are used for this?

antonmks commented 11 years ago

It works like this : http://www.qdpma.com/tpch/TPCH100_Query_plans.html

On Wed, Jun 12, 2013 at 6:37 PM, AlexeyAB notifications@github.com wrote:

What do you mean by correlation columns, for what types of join it works, and what algorithms are used for this?

— Reply to this email directly or view it on GitHubhttps://github.com/antonmks/Alenka/issues/31#issuecomment-19334165 .

AlexeyAB commented 11 years ago

This requires the optimizer, which does not yet exist in AlenkaDB, and it is unknown how well it will be on real-life projects and with Storage Index, but without the B-tree index. In comparisons with other DBMS and servers, TPC-H should not have much matter. But the same result by AlenkaDB on the GTX 580 and Titan, it says to us about the possibilities of optimization :)

Best regards, Alexey

antonmks commented 11 years ago

Ok, here is (very) short description of architecture :

Columns are stored in separate files as binary arrays. Arrays are further divided and stored in segments of a few million values and all the processing is done on segments. Algorithms used for SORT : thrust::stable_sort_by_key() JOIN : mgpu::RelationalJoin() or CUDPP hash join JOIN for strings: strings are hashed to 64 bit unsigned integers and then joined using one of the above methods. FILTER : thrust::copy_if() GROUP BY: segments are reduced on a GPU and then merged/reduced on the host with the results.

Differences from classic implementations of PFOR, DELTA-PFOR is that Alenka does not implements outlier values. Although it is easy to add if needed. Also, all PFOR compression/decompression runs on a GPU. The same applies for Dictionary decompression.

Not sure about MurmurHash conflict resolution - cannot really check that all the column values match - it would be rather slow and what would be the point of using hashing at all.

Well, I hope this helps.

antonmks / Alenka

Whether to do short description of the architecture and implementation? #31