NVIDIA / spark-rapids

Spark RAPIDS plugin - accelerate Apache Spark with GPUs
https://nvidia.github.io/spark-rapids
Apache License 2.0
797 stars 232 forks source link

Question about stream time metric #5720

Closed cfangplus closed 2 years ago

cfangplus commented 2 years ago

hi,

I run a SQL which contains four stages, and the 1st stage aims to scan the parquet files and prepare shuffle write data for the next stage and the mean time of tasks is about 4s. To reduce the scan time from disk to GPU, before run the SQL I used CACHE Table to cache the data. So now the 1st stage is supposed to transfer data from DRAM to GPU which I thought could be faster. However, it's not. After I compare the running details from SQL TAB , I found that there comes a stream time metric and it amounts to be 3.6s which contributes the mean task to be 6s. Why? I think that's unbelievable, does the former one do not contain stream time ?

cfangplus commented 2 years ago

Provide more details and my guess as follow, need to check. Indeed, the additional stream time metric that was mentioned above comes from GPURowToColumnar node. Although CACHE TABLE persist the table data within dram, it's data format is still row-oriented and in fact, as columnar format is more suitable for GPU processing, So Rapids need operation to convert row to column, that's what GPURowToColumnar does. While parquet is one kind of columnar storage format, so the former case need not to convert row to column operation, that's to say, the former case need not GPURowToColumnar. So the former case is faster and would have a better performance. Right ?

viadea commented 2 years ago

Hi @cfangplus Not sure if you have tried PCBS?

Could you share the Physical plan for the GPU runs before and after the "cache"?

cfangplus commented 2 years ago

yea, after I use PCBS, the physical plan indeed take some difference with that without PCBS. The GpuRowToColumnar node after Scan In-memory table disappeared. thx @viadea