Open JigaoLuo opened 2 years ago
select l_suppkey, count(*)
from lineitem
group by l_suppkey
order by l_suppkey
lineitem
<<<22, 128>>>>
Kernel1 | Kernel2 | Total Runtime | |
---|---|---|---|
DogQC | 13.3 ms | 1.7 ms | 15.0 ms |
Shared Memory with <<<(6001215 + 128 - 1) / 128, 128>>> |
13.7 ms | 1.7 ms | 15.4 ms |
Shared&Global Memory | 13.6 ms | 1.7 ms | 15.3 ms |
Shared&Global Memory + Perfect Global Hash Table Size | 6.9 ms | 0 | 6.9 ms |
In <<<22, 128>>>>
all hash tables full.
<<<440, 128>>>>
Kernel1 | Kernel2 | Total Runtime | |
---|---|---|---|
DogQC | 15.1 ms | 1.7 ms | 16.8 ms |
Shared Memory with <<<(6001215 + 128 - 1) / 128, 128>>> |
13.7 ms | 1.7 ms | 15.4 ms |
Shared&Global Memory | 15.0 ms | 1.7 ms | 16.7 ms |
Shared&Global Memory + Perfect Global Hash Table Size | 2.5 ms | 0 | 2.5 ms |
In <<<440, 128>>>>
all hash tables full.
More on: https://github.com/cakebytheoceanLuo/dogqc/issues/35
select l_orderkey, count(*)
from lineitem
group by l_orderkey
order by l_orderkey
lineitem
<<<22, 128>>>>
Kernel1 | Kernel2 | Total Runtime | |
---|---|---|---|
DogQC | 8.7 ms | 2.0 ms | 10.7 ms |
Shared Memory with <<<(6001215 + 128 - 1) / 128, 128>>> |
6.5 ms | 2.0 ms | 8.5 ms |
Shared&Global Memory | 8.9 ms | 2.0 ms | 10.9 ms |
In <<<22, 128>>>>
all hash tables full.
<<<440, 128>>>>
Kernel1 | Kernel2 | Total Runtime | |
---|---|---|---|
DogQC | 7.0 ms | 2.0 ms | 9.0 ms |
Shared Memory with <<<(6001215 + 128 - 1) / 128, 128>>> |
6.5 ms | 2.0 ms | 8.5 ms |
Shared&Global Memory | 7.1 ms | 2.0 ms | 9.1 ms |
In <<<440, 128>>>>
all hash tables full.
More on: https://github.com/cakebytheoceanLuo/dogqc/issues/32
hash table size
HT_FULL_FLAG
is set when hash conflict reaches 50 times aka the N
.
Not really a full hash table
krnl_lineitem1(... global_memory_hash_table gm_ht) {
__shared__ shared_memory_hash_table sm_ht;
volatile __shared__ int HT_FULL_FLAG; HT_FULL_FLAG = 0;
while (...) {
/// Insertion into shared memory hash table
get element for this iteration;
int inserted_index = insert_into_sm_ht(element);
if (inserted_index == -1) ++HT_FULL_FLAG;
/// Sync & Extra work
__syncthreads();
if (HT_FULL_FLAG != 0) {
copy_shared_memory_ht_to_global_memory();
__threadfence_block();
init_shared_memory_ht();
HT_FULL_FLAG = 0;
__syncthreads();
}
}
/// Last time, if shared memory hash table not full.
__syncthreads();
copy_shared_memory_ht_to_global_memory();
}
Memory 0.2 / 5.8 GB
Memory 0.5 / 5.8 GB
Result: 10000 tuples
l_suppkey: 7793 count_l_suppkey: 604
l_suppkey: 7829 count_l_suppkey: 594
l_suppkey: 6071 count_l_suppkey: 579
l_suppkey: 5449 count_l_suppkey: 616
l_suppkey: 3980 count_l_suppkey: 584
l_suppkey: 3505 count_l_suppkey: 600
l_suppkey: 5203 count_l_suppkey: 598
l_suppkey: 5329 count_l_suppkey: 574
l_suppkey: 5420 count_l_suppkey: 600
l_suppkey: 7774 count_l_suppkey: 590
[...]
<timing>
krnl_lineitem1: **15.4 ms**
krnl_aggregation2: 1.7 ms
finish: 0.0 ms
totalKernelTime: 17.1 ms
</timing>
(base) jigao@jigao-System-Product-Name:~/Desktop/dogqc/bin$ ./28
Memory 0.2 / 5.8 GB
Memory 0.5 / 5.8 GB
Shared memory usage: 28672 / 49152 bytes.
Result: 10000 tuples
l_suppkey: 5449 count_l_suppkey: 616
l_suppkey: 6071 count_l_suppkey: 579
l_suppkey: 7829 count_l_suppkey: 594
l_suppkey: 3980 count_l_suppkey: 584
l_suppkey: 7793 count_l_suppkey: 604
l_suppkey: 5203 count_l_suppkey: 598
l_suppkey: 5329 count_l_suppkey: 574
l_suppkey: 5420 count_l_suppkey: 600
l_suppkey: 3505 count_l_suppkey: 600
l_suppkey: 7774 count_l_suppkey: 590
[...]
<timing>
finish: 4.5 ms
krnl_lineitem1: **13.4 ms**
krnl_aggregation2: 1.7 ms
totalKernelTime: 15.0 ms
</timing>
Result: 100000 tuples
l_suppkey: 25875 count_l_suppkey: 620
l_suppkey: 84736 count_l_suppkey: 592
l_suppkey: 15544 count_l_suppkey: 598
l_suppkey: 19929 count_l_suppkey: 633
l_suppkey: 38985 count_l_suppkey: 606
l_suppkey: 93389 count_l_suppkey: 558
l_suppkey: 51879 count_l_suppkey: 560
l_suppkey: 99493 count_l_suppkey: 538
l_suppkey: 99776 count_l_suppkey: 591
l_suppkey: 79519 count_l_suppkey: 601
[...]
<timing>
krnl_lineitem1: **323.6 ms**
krnl_aggregation2: 16.5 ms
finish: 0.0 ms
totalKernelTime: 340.2 ms
</timing>
Memory 0.8 / 5.8 GB
Memory 3.9 / 5.8 GB
Shared memory usage: 28672 / 49152 bytes.
Result: 100000 tuples
l_suppkey: 84736 count_l_suppkey: 592
l_suppkey: 15544 count_l_suppkey: 598
l_suppkey: 25875 count_l_suppkey: 620
l_suppkey: 19929 count_l_suppkey: 633
l_suppkey: 38985 count_l_suppkey: 606
l_suppkey: 93389 count_l_suppkey: 558
l_suppkey: 99776 count_l_suppkey: 591
l_suppkey: 99493 count_l_suppkey: 538
l_suppkey: 94540 count_l_suppkey: 618
l_suppkey: 79519 count_l_suppkey: 601
[...]
<timing>
finish: 53.9 ms
krnl_lineitem1: **223.2 ms**
krnl_aggregation2: 16.5 ms
totalKernelTime: 239.7 ms
</timing>
Question: why sublinear from SF1 to SF10?
Answer: the copy_shared_memory_ht_to_global_memory();
is the non-linear part.
krnl_lineitem1: 18.5 ms
/// Insertion into shared memory hash table
: ~3.0 ms/// Sync & Extra work
: ~12.0 ms
__syncthreads();
: ~6.8 ms__syncthreads();
: ~0.2 mscopy_shared_memory_ht_to_global_memory();
: ~5.1 msinit_shared_memory_ht();
: 0.93 mskrnl_lineitem1: 247.8 ms
/// Insertion into shared memory hash table
: ~44.0 ms/// Sync & Extra work
: ~197.0 ms
__syncthreads();
: ~66.2 ms__syncthreads();
: ~2.4 mscopy_shared_memory_ht_to_global_memory();
: ~105.1 msinit_shared_memory_ht();
: 0.08 msWell, it is natural to think, if 2 times __syncthreads()
are too much. In my imagination, 2 times __syncthreads()
in a while-loop's iteration force all threads to march at the same speed to finish the iterations.
Hmm, if I just CPU threads to do this job, I suppose I can save the first and the second sync. (?)
But CUDA is nothing similar to CPU. CUDA hates sync inside of condition&branch: https://stackoverflow.com/questions/12519573/cuda-syncthreads-inside-if-statements
__syncthreads();
in the if
if (HT_FULL_FLAG != 0) {
__syncthreads();
copy_shared_memory_ht_to_global_memory();
__threadfence_block();
init_shared_memory_ht();
HT_FULL_FLAG = 0;
__syncthreads();
}
Not working.
A case: Thread0 in warp0 is blocked by __syncthreads()
. Thread1 in warp0 didn't step into the if
and not blocked. But the whole warp is blocked.
=> Deadlock
__syncwarp();
outside the loopI can just sync HT_FULL_FLAG
in a warp.
__syncwarp();
if (HT_FULL_FLAG != 0) {
__syncthreads();
copy_shared_memory_ht_to_global_memory();
__threadfence_block();
init_shared_memory_ht();
HT_FULL_FLAG = 0;
__syncthreads();
}
__syncwarp();
Not working.
Well, one global hash table seems like a hotspot to do parallel insertions. As discussed, maybe we need more than 1 global hash table.
Pseudocode