JigaoLuo / dogqc

A query compiler for GPUs that translates relational algebra to Cuda.
MIT License
0 stars 0 forks source link

shared memory ht + global memory ht #47

Open JigaoLuo opened 2 years ago

JigaoLuo commented 2 years ago

Pseudocode

kernel1(... global_memory_hash_table gm_ht) {
  __shared__ shared_memory_hash_table sm_ht;
  __shared__ int sm_ht_full; sm_ht_full = 0;
  while (...) {
      get element for this iteration;
      if (sm_ht_full == 0) {
        insert element into sm_ht;
        if (sm_ht.full()) sm_ht_full = 1;
      } else {
        insert element into gm_ht;
      }
  }

  get a sh_ht entry;
  if (entry != EMPTY)
    insert this entry into gm_ht;
}
kernel2(... global_memory_hash_table gm_ht) {
  get a sh_ht entry;
  if (entry != EMPTY)
    write entry to output
}
JigaoLuo commented 2 years ago

Query28

SQL

select l_suppkey, count(*)
from lineitem
group by l_suppkey
order by l_suppkey

Performance

<<<22, 128>>>>

Kernel1 Kernel2 Total Runtime
DogQC 13.3 ms 1.7 ms 15.0 ms
Shared Memory with <<<(6001215 + 128 - 1) / 128, 128>>> 13.7 ms 1.7 ms 15.4 ms
Shared&Global Memory 13.6 ms 1.7 ms 15.3 ms
Shared&Global Memory + Perfect Global Hash Table Size 6.9 ms 0 6.9 ms

In <<<22, 128>>>> all hash tables full.

<<<440, 128>>>>

Kernel1 Kernel2 Total Runtime
DogQC 15.1 ms 1.7 ms 16.8 ms
Shared Memory with <<<(6001215 + 128 - 1) / 128, 128>>> 13.7 ms 1.7 ms 15.4 ms
Shared&Global Memory 15.0 ms 1.7 ms 16.7 ms
Shared&Global Memory + Perfect Global Hash Table Size 2.5 ms 0 2.5 ms

In <<<440, 128>>>> all hash tables full.

More on: https://github.com/cakebytheoceanLuo/dogqc/issues/35

JigaoLuo commented 2 years ago

Query27

SQL

select l_orderkey, count(*) 
from lineitem 
group by l_orderkey
order by l_orderkey

Performance

<<<22, 128>>>>

Kernel1 Kernel2 Total Runtime
DogQC 8.7 ms 2.0 ms 10.7 ms
Shared Memory with <<<(6001215 + 128 - 1) / 128, 128>>> 6.5 ms 2.0 ms 8.5 ms
Shared&Global Memory 8.9 ms 2.0 ms 10.9 ms

In <<<22, 128>>>> all hash tables full.

<<<440, 128>>>>

Kernel1 Kernel2 Total Runtime
DogQC 7.0 ms 2.0 ms 9.0 ms
Shared Memory with <<<(6001215 + 128 - 1) / 128, 128>>> 6.5 ms 2.0 ms 8.5 ms
Shared&Global Memory 7.1 ms 2.0 ms 9.1 ms

In <<<440, 128>>>> all hash tables full.

More on: https://github.com/cakebytheoceanLuo/dogqc/issues/32

JigaoLuo commented 2 years ago

hash table size

JigaoLuo commented 2 years ago

Pseudocode

HT_FULL_FLAG is set when hash conflict reaches 50 times aka the N. Not really a full hash table

krnl_lineitem1(... global_memory_hash_table gm_ht) {
  __shared__ shared_memory_hash_table sm_ht;
  volatile __shared__ int HT_FULL_FLAG; HT_FULL_FLAG = 0;

  while (...) {
      /// Insertion into shared memory hash table
      get element for this iteration;
      int inserted_index = insert_into_sm_ht(element);
      if (inserted_index == -1) ++HT_FULL_FLAG;

      /// Sync & Extra work
      __syncthreads();
      if (HT_FULL_FLAG != 0) {
          copy_shared_memory_ht_to_global_memory();
          __threadfence_block();
          init_shared_memory_ht();
          HT_FULL_FLAG = 0;
          __syncthreads();
      }
  }

  /// Last time, if shared memory hash table not full.
  __syncthreads();
  copy_shared_memory_ht_to_global_memory();
}

Performance Example SQL 28

35

SF1: 15.4 ms V.S. 13.4 ms

DogQC

Memory 0.2 / 5.8 GB
Memory 0.5 / 5.8 GB

Result: 10000 tuples
l_suppkey:     7793  count_l_suppkey:      604  
l_suppkey:     7829  count_l_suppkey:      594  
l_suppkey:     6071  count_l_suppkey:      579  
l_suppkey:     5449  count_l_suppkey:      616  
l_suppkey:     3980  count_l_suppkey:      584  
l_suppkey:     3505  count_l_suppkey:      600  
l_suppkey:     5203  count_l_suppkey:      598  
l_suppkey:     5329  count_l_suppkey:      574  
l_suppkey:     5420  count_l_suppkey:      600  
l_suppkey:     7774  count_l_suppkey:      590  
[...]

<timing>
                  krnl_lineitem1:   **15.4 ms**
               krnl_aggregation2:    1.7 ms
                          finish:    0.0 ms
                 totalKernelTime:   17.1 ms
</timing>

Shared Memory Version

(base) jigao@jigao-System-Product-Name:~/Desktop/dogqc/bin$ ./28
Memory 0.2 / 5.8 GB
Memory 0.5 / 5.8 GB
Shared memory usage: 28672 / 49152 bytes.

Result: 10000 tuples
l_suppkey:     5449  count_l_suppkey:      616  
l_suppkey:     6071  count_l_suppkey:      579  
l_suppkey:     7829  count_l_suppkey:      594  
l_suppkey:     3980  count_l_suppkey:      584  
l_suppkey:     7793  count_l_suppkey:      604  
l_suppkey:     5203  count_l_suppkey:      598  
l_suppkey:     5329  count_l_suppkey:      574  
l_suppkey:     5420  count_l_suppkey:      600  
l_suppkey:     3505  count_l_suppkey:      600  
l_suppkey:     7774  count_l_suppkey:      590  
[...]

<timing>
                          finish:    4.5 ms
                  krnl_lineitem1:   **13.4 ms**
               krnl_aggregation2:    1.7 ms
                 totalKernelTime:   15.0 ms
</timing>

SF10: 323.6 ms V.S. 223.2 ms

DogQC

Result: 100000 tuples
l_suppkey:    25875  count_l_suppkey:      620  
l_suppkey:    84736  count_l_suppkey:      592  
l_suppkey:    15544  count_l_suppkey:      598  
l_suppkey:    19929  count_l_suppkey:      633  
l_suppkey:    38985  count_l_suppkey:      606  
l_suppkey:    93389  count_l_suppkey:      558  
l_suppkey:    51879  count_l_suppkey:      560  
l_suppkey:    99493  count_l_suppkey:      538  
l_suppkey:    99776  count_l_suppkey:      591  
l_suppkey:    79519  count_l_suppkey:      601  
[...]

<timing>
                  krnl_lineitem1:  **323.6 ms**
               krnl_aggregation2:   16.5 ms
                          finish:    0.0 ms
                 totalKernelTime:  340.2 ms
</timing>

Shared Memory Version

Memory 0.8 / 5.8 GB
Memory 3.9 / 5.8 GB
Shared memory usage: 28672 / 49152 bytes.

Result: 100000 tuples
l_suppkey:    84736  count_l_suppkey:      592  
l_suppkey:    15544  count_l_suppkey:      598  
l_suppkey:    25875  count_l_suppkey:      620  
l_suppkey:    19929  count_l_suppkey:      633  
l_suppkey:    38985  count_l_suppkey:      606  
l_suppkey:    93389  count_l_suppkey:      558  
l_suppkey:    99776  count_l_suppkey:      591  
l_suppkey:    99493  count_l_suppkey:      538  
l_suppkey:    94540  count_l_suppkey:      618  
l_suppkey:    79519  count_l_suppkey:      601  
[...]

<timing>
                          finish:   53.9 ms
                  krnl_lineitem1:  **223.2 ms**
               krnl_aggregation2:   16.5 ms
                 totalKernelTime:  239.7 ms
</timing>

Question: why sublinear from SF1 to SF10? Answer: the copy_shared_memory_ht_to_global_memory(); is the non-linear part.


Runtime Breakdown - Query 28

SF1

krnl_lineitem1: 18.5 ms

SF10

krnl_lineitem1: 247.8 ms


Thinking for Profiling

Well, it is natural to think, if 2 times __syncthreads() are too much. In my imagination, 2 times __syncthreads() in a while-loop's iteration force all threads to march at the same speed to finish the iterations. Hmm, if I just CPU threads to do this job, I suppose I can save the first and the second sync. (?) But CUDA is nothing similar to CPU. CUDA hates sync inside of condition&branch: https://stackoverflow.com/questions/12519573/cuda-syncthreads-inside-if-statements

Idea 1: let __syncthreads(); in the if

      if (HT_FULL_FLAG != 0) {
       __syncthreads();
          copy_shared_memory_ht_to_global_memory();
          __threadfence_block();
          init_shared_memory_ht();
          HT_FULL_FLAG = 0;
      __syncthreads();
      }

Not working. A case: Thread0 in warp0 is blocked by __syncthreads(). Thread1 in warp0 didn't step into the if and not blocked. But the whole warp is blocked. => Deadlock

Idea 2: use __syncwarp(); outside the loop

I can just sync HT_FULL_FLAG in a warp.

      __syncwarp();
      if (HT_FULL_FLAG != 0) {
          __syncthreads();
          copy_shared_memory_ht_to_global_memory();
          __threadfence_block();
          init_shared_memory_ht();
          HT_FULL_FLAG = 0;
          __syncthreads();
      }
      __syncwarp();

Not working.

More Thinking for Profiling

Well, one global hash table seems like a hotspot to do parallel insertions. As discussed, maybe we need more than 1 global hash table.