JigaoLuo commented 2 years ago

Pseudocode

kernel1(... global_memory_hash_table gm_ht) {
  __shared__ shared_memory_hash_table sm_ht;
  __shared__ int sm_ht_full; sm_ht_full = 0;
  while (...) {
      get element for this iteration;
      if (sm_ht_full == 0) {
        insert element into sm_ht;
        if (sm_ht.full()) sm_ht_full = 1;
      } else {
        insert element into gm_ht;
      }
  }

  get a sh_ht entry;
  if (entry != EMPTY)
    insert this entry into gm_ht;
}

kernel2(... global_memory_hash_table gm_ht) {
  get a sh_ht entry;
  if (entry != EMPTY)
    write entry to output
}

JigaoLuo commented 2 years ago

Query28

SQL

select l_suppkey, count(*)
from lineitem
group by l_suppkey
order by l_suppkey

Input Cardinality: 6001215 as the size of relation lineitem
Output Cardinality: 10000

Performance

`<<<22, 128>>>>`

	Kernel1	Kernel2	Total Runtime
DogQC	13.3 ms	1.7 ms	15.0 ms
Shared Memory with `<<<(6001215 + 128 - 1) / 128, 128>>>`	13.7 ms	1.7 ms	15.4 ms
Shared&Global Memory	13.6 ms	1.7 ms	15.3 ms
Shared&Global Memory + Perfect Global Hash Table Size	6.9 ms	0	6.9 ms

In <<<22, 128>>>> all hash tables full.

`<<<440, 128>>>>`

	Kernel1	Kernel2	Total Runtime
DogQC	15.1 ms	1.7 ms	16.8 ms
Shared Memory with `<<<(6001215 + 128 - 1) / 128, 128>>>`	13.7 ms	1.7 ms	15.4 ms
Shared&Global Memory	15.0 ms	1.7 ms	16.7 ms
Shared&Global Memory + Perfect Global Hash Table Size	2.5 ms	0	2.5 ms

In <<<440, 128>>>> all hash tables full.

More on: https://github.com/cakebytheoceanLuo/dogqc/issues/35

JigaoLuo commented 2 years ago

Query27

SQL

select l_orderkey, count(*) 
from lineitem 
group by l_orderkey
order by l_orderkey

Input Cardinality: 6001215 as the size of relation lineitem
Output Cardinality: 1500000

Performance

`<<<22, 128>>>>`

	Kernel1	Kernel2	Total Runtime
DogQC	8.7 ms	2.0 ms	10.7 ms
Shared Memory with `<<<(6001215 + 128 - 1) / 128, 128>>>`	6.5 ms	2.0 ms	8.5 ms
Shared&Global Memory	8.9 ms	2.0 ms	10.9 ms

In <<<22, 128>>>> all hash tables full.

`<<<440, 128>>>>`

	Kernel1	Kernel2	Total Runtime
DogQC	7.0 ms	2.0 ms	9.0 ms
Shared Memory with `<<<(6001215 + 128 - 1) / 128, 128>>>`	6.5 ms	2.0 ms	8.5 ms
Shared&Global Memory	7.1 ms	2.0 ms	9.1 ms

In <<<440, 128>>>> all hash tables full.

More on: https://github.com/cakebytheoceanLuo/dogqc/issues/32

JigaoLuo commented 2 years ago

hash table size

cache effect?

JigaoLuo commented 2 years ago

Pseudocode

HT_FULL_FLAG is set when hash conflict reaches 50 times aka the N. Not really a full hash table

krnl_lineitem1(... global_memory_hash_table gm_ht) {
  __shared__ shared_memory_hash_table sm_ht;
  volatile __shared__ int HT_FULL_FLAG; HT_FULL_FLAG = 0;

  while (...) {
      /// Insertion into shared memory hash table
      get element for this iteration;
      int inserted_index = insert_into_sm_ht(element);
      if (inserted_index == -1) ++HT_FULL_FLAG;

      /// Sync & Extra work
      __syncthreads();
      if (HT_FULL_FLAG != 0) {
          copy_shared_memory_ht_to_global_memory();
          __threadfence_block();
          init_shared_memory_ht();
          HT_FULL_FLAG = 0;
          __syncthreads();
      }
  }

  /// Last time, if shared memory hash table not full.
  __syncthreads();
  copy_shared_memory_ht_to_global_memory();
}

Performance Example SQL 28

35

SF1: 15.4 ms V.S. 13.4 ms

DogQC

Memory 0.2 / 5.8 GB
Memory 0.5 / 5.8 GB

Result: 10000 tuples
l_suppkey:     7793  count_l_suppkey:      604  
l_suppkey:     7829  count_l_suppkey:      594  
l_suppkey:     6071  count_l_suppkey:      579  
l_suppkey:     5449  count_l_suppkey:      616  
l_suppkey:     3980  count_l_suppkey:      584  
l_suppkey:     3505  count_l_suppkey:      600  
l_suppkey:     5203  count_l_suppkey:      598  
l_suppkey:     5329  count_l_suppkey:      574  
l_suppkey:     5420  count_l_suppkey:      600  
l_suppkey:     7774  count_l_suppkey:      590  
[...]

<timing>
                  krnl_lineitem1:   **15.4 ms**
               krnl_aggregation2:    1.7 ms
                          finish:    0.0 ms
                 totalKernelTime:   17.1 ms
</timing>

Shared Memory Version

(base) jigao@jigao-System-Product-Name:~/Desktop/dogqc/bin$ ./28
Memory 0.2 / 5.8 GB
Memory 0.5 / 5.8 GB
Shared memory usage: 28672 / 49152 bytes.

Result: 10000 tuples
l_suppkey:     5449  count_l_suppkey:      616  
l_suppkey:     6071  count_l_suppkey:      579  
l_suppkey:     7829  count_l_suppkey:      594  
l_suppkey:     3980  count_l_suppkey:      584  
l_suppkey:     7793  count_l_suppkey:      604  
l_suppkey:     5203  count_l_suppkey:      598  
l_suppkey:     5329  count_l_suppkey:      574  
l_suppkey:     5420  count_l_suppkey:      600  
l_suppkey:     3505  count_l_suppkey:      600  
l_suppkey:     7774  count_l_suppkey:      590  
[...]

<timing>
                          finish:    4.5 ms
                  krnl_lineitem1:   **13.4 ms**
               krnl_aggregation2:    1.7 ms
                 totalKernelTime:   15.0 ms
</timing>

SF10: 323.6 ms V.S. 223.2 ms

DogQC

Result: 100000 tuples
l_suppkey:    25875  count_l_suppkey:      620  
l_suppkey:    84736  count_l_suppkey:      592  
l_suppkey:    15544  count_l_suppkey:      598  
l_suppkey:    19929  count_l_suppkey:      633  
l_suppkey:    38985  count_l_suppkey:      606  
l_suppkey:    93389  count_l_suppkey:      558  
l_suppkey:    51879  count_l_suppkey:      560  
l_suppkey:    99493  count_l_suppkey:      538  
l_suppkey:    99776  count_l_suppkey:      591  
l_suppkey:    79519  count_l_suppkey:      601  
[...]

<timing>
                  krnl_lineitem1:  **323.6 ms**
               krnl_aggregation2:   16.5 ms
                          finish:    0.0 ms
                 totalKernelTime:  340.2 ms
</timing>

Shared Memory Version

Memory 0.8 / 5.8 GB
Memory 3.9 / 5.8 GB
Shared memory usage: 28672 / 49152 bytes.

Result: 100000 tuples
l_suppkey:    84736  count_l_suppkey:      592  
l_suppkey:    15544  count_l_suppkey:      598  
l_suppkey:    25875  count_l_suppkey:      620  
l_suppkey:    19929  count_l_suppkey:      633  
l_suppkey:    38985  count_l_suppkey:      606  
l_suppkey:    93389  count_l_suppkey:      558  
l_suppkey:    99776  count_l_suppkey:      591  
l_suppkey:    99493  count_l_suppkey:      538  
l_suppkey:    94540  count_l_suppkey:      618  
l_suppkey:    79519  count_l_suppkey:      601  
[...]

<timing>
                          finish:   53.9 ms
                  krnl_lineitem1:  **223.2 ms**
               krnl_aggregation2:   16.5 ms
                 totalKernelTime:  239.7 ms
</timing>

Question: why sublinear from SF1 to SF10? Answer: the copy_shared_memory_ht_to_global_memory(); is the non-linear part.

The global hash table becomes dense and dense with an increasing number of hash collisions and probing.
The huge number of parallel insertions make the global memory hash table a hotspot. Check the breakdown in the following part. [Detail TODO]

Runtime Breakdown - Query 28

SF1

krnl_lineitem1: 18.5 ms

/// Insertion into shared memory hash table: ~3.0 ms
/// Sync & Extra work: ~12.0 ms
- first __syncthreads();: ~6.8 ms
- second __syncthreads();: ~0.2 ms
- copy_shared_memory_ht_to_global_memory();: ~5.1 ms
- init_shared_memory_ht();: 0.93 ms

SF10

krnl_lineitem1: 247.8 ms

/// Insertion into shared memory hash table: ~44.0 ms
/// Sync & Extra work: ~197.0 ms
- first __syncthreads();: ~66.2 ms
- second __syncthreads();: ~2.4 ms
- copy_shared_memory_ht_to_global_memory();: ~105.1 ms
- init_shared_memory_ht();: 0.08 ms

Thinking for Profiling

Well, it is natural to think, if 2 times __syncthreads() are too much. In my imagination, 2 times __syncthreads() in a while-loop's iteration force all threads to march at the same speed to finish the iterations. Hmm, if I just CPU threads to do this job, I suppose I can save the first and the second sync. (?) But CUDA is nothing similar to CPU. CUDA hates sync inside of condition&branch: https://stackoverflow.com/questions/12519573/cuda-syncthreads-inside-if-statements

Idea 1: let `__syncthreads();` in the `if`

      if (HT_FULL_FLAG != 0) {
       __syncthreads();
          copy_shared_memory_ht_to_global_memory();
          __threadfence_block();
          init_shared_memory_ht();
          HT_FULL_FLAG = 0;
      __syncthreads();
      }

Not working. A case: Thread0 in warp0 is blocked by __syncthreads(). Thread1 in warp0 didn't step into the if and not blocked. But the whole warp is blocked. => Deadlock

Idea 2: use `__syncwarp();` outside the loop

I can just sync HT_FULL_FLAG in a warp.

      __syncwarp();
      if (HT_FULL_FLAG != 0) {
          __syncthreads();
          copy_shared_memory_ht_to_global_memory();
          __threadfence_block();
          init_shared_memory_ht();
          HT_FULL_FLAG = 0;
          __syncthreads();
      }
      __syncwarp();

Not working.

More Thinking for Profiling

Well, one global hash table seems like a hotspot to do parallel insertions. As discussed, maybe we need more than 1 global hash table.

JigaoLuo / dogqc

shared memory ht + global memory ht #47

Pseudocode

Query28

SQL

Performance

`<<<22, 128>>>>`

`<<<440, 128>>>>`

Query27

SQL

Performance

`<<<22, 128>>>>`

`<<<440, 128>>>>`

Pseudocode

Performance Example SQL 28

35

SF1: 15.4 ms V.S. 13.4 ms

DogQC

Shared Memory Version

SF10: 323.6 ms V.S. 223.2 ms

DogQC

Shared Memory Version

Runtime Breakdown - Query 28

SF1

SF10

Thinking for Profiling

Idea 1: let `__syncthreads();` in the `if`

Idea 2: use `__syncwarp();` outside the loop

More Thinking for Profiling

JigaoLuo / dogqc

shared memory ht + global memory ht #47

Pseudocode

Query28

SQL

Performance

<<<22, 128>>>>

<<<440, 128>>>>

Query27

SQL

Performance

<<<22, 128>>>>

<<<440, 128>>>>

Pseudocode

Performance Example SQL 28

35

SF1: 15.4 ms V.S. 13.4 ms

DogQC

Shared Memory Version

SF10: 323.6 ms V.S. 223.2 ms

DogQC

Shared Memory Version

Runtime Breakdown - Query 28

SF1

SF10

Thinking for Profiling

Idea 1: let __syncthreads(); in the if

Idea 2: use __syncwarp(); outside the loop

More Thinking for Profiling

`<<<22, 128>>>>`

`<<<440, 128>>>>`

`<<<22, 128>>>>`

`<<<440, 128>>>>`

Idea 1: let `__syncthreads();` in the `if`

Idea 2: use `__syncwarp();` outside the loop