[QST] Could the cudaStreamSynchronize method be a bottleneck?

andreamartini commented 2 months ago

My question is: Is it possible reduce the CPU time required by cudaStreamSynchronize ? Context: My goal is to use the nvComp library, in order to occupy (for compression operations), as little cpu time as possible. For that reason, I detected the cpu time with std:chrono, immediately before the data transfer operations from host to device, and immediately after retrieving the compressed data from host. To retrieve the compressed data from Host, in order to perform cpu-side operations (save compressed data on disk, send them over network, ...), I have to run the cudaStreamSynchronize(_cudaStream) command, where _cudaStream is the stream used by all cudaMemcpyAsync operations and by the nvcompBatchedZstdCompressAsync (or LZ4, Deflate and so on) compression method. All operations performed, i.e., host device transfer, compression, and host device transfer, are done using asynchronous methods on _cudaStream. The problem is that the CPU time I detect after invoking the cudaStreamSynchronize command is a time that seems high to me, and for my purposes it seems to be the bottleneck.

Some details:
OS: Windows 11, 32 GB RAM CPU: 11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz GPU: NVIDIA GeForce GTX 1650 immagine

The table above contains the results of performing the compression of data blocks (consisting of number of gradually increasing images: 1 frame block, 10 frames block, 20 frames block and 90 frames block) and the respective CPU times. The size of each single image is 3.9 MB (all images have the same dimension).

Looking at the "CPU Comp ElapsedTime (ms)" column, you can see that for 3.9 MB image size, compression requires from 20 to 30 ms (apart the first case which has a very high time, but I couldn't find an explanation). This time grows linearly as the file size increases, but the average cpu time per image (column "AvgCPU Compression Time per Image (ms)" obtained by CPU Time / Number of frames) varies between 12 and 17 ms.

Is it possible to reduce the CPU time required by cudaStreamSynchronize? I.e. is it possible get the average cpu time per image under 5 ms?

I tried creating k std::thread async to compress k images using a different stream for each thread. Most likely I'm doing something wrong as the total time for compressing the k images turns out to be high (500 ms). Could multithreaded approach reduce cpu time caused by cudaStreamSynchronize?

Thank you in advance Andrea

ndickson-nvidia commented 2 months ago

Hi! What you're measuring isn't the time that the CPU is active. The nvcomp compression functions just queue up the compression kernels to occur on the GPU asynchronously. cudaStreamSynchronize simply waits for the GPU stream to finish all of those queued GPU computations. If you profile your program using Nsight Systems on the command line and then open the report file using the Nsight Systems user interface, you should be able to get a better view of what's going on between the GPU and CPU code.

andreamartini commented 2 months ago

Hi, I actually used a misleading time column name. What I am measuring is the system time that elapses between loading the data to be compressed from host to device, and retrieving the compressed data from the device. Following the updated table:

immagine

Here a short pseudo code that shows where i read system time:


// start timer
auto tStartCompression = std::chrono::steady_clock::now();
// copy data from host to device using async copy
....
// call compression algorithm like ... 
 nvcompStatus_t      BatchData::_runZStandardCompression()
    {
        return nvcompBatchedZstdCompressAsync(
            _pDevice_inputData_as_Chunks_ptrs,
            _pDevice_inputData_chunks_size,
            _stChunkSize, // The maximum chunk size
            _stNumberOfChunks,
            _pDevice_temp_space_ptr,
            _stTemp_bytes,
            _pDevice_outputCompressed_Chunks_ptrs,
            _pDevice_compressed_chunks_size,
            _nvcompBatchedZStdOpts,
            _cudaStream);
    } 

// Get back compressed data from device to host, using  async copy

// MUST: synchronize GPU and CPU Code for retrieving compressed data 
cudaStreamSynchronize(_cudaStream);   // It seems to be the reason of bottleneck

auto            tEndCompression = std::chrono::steady_clock::now();     // Stop timer

I would add that I perform the compression using the approach indicated by the low-level example, as I need to decompress on another PC without using nvcomp, and therefore I need to recover the size of each compressed chunk (I have not found anything on the high level example that allows me to do this).

However, to try to understand how to reduce the system time (and thus the bottleneck caused by the cudaStreamSynchronize), I used Nsight to try to understand what was happening between the GPU and CPU (as suggested).

immagine

The firs suggestion says: "Threads are executed in groups of 32 threads called warps. This kernel launch is configured to execute 1 threads per block. Consequently, some threads in a warp are masked off and those hardware resources are unused. Try changing the number of threads per block to be a multiple of 32 threads. Between 128 and 256 threads per block is a good initial range for experimentation. Use smaller thread blocks rather than one large thread block per multiprocessor if latency affects performance. This is particularly beneficial to kernels that frequently call __syncthreads(). See the Hardware Model description for more details on launch configurations."

It seems that one of the suggestions is to appropriately size the number of threads per block. Is it possible to configure nvcomp for using specific threadsNumber per block? is there any other approach to reduce the waiting time caused by the cudaStreamSynchronize?

thank you

ndickson-nvidia commented 2 months ago

That looks like Nsight Compute, instead of Nsight Systems. Sorry, it's a bit confusing that the two different programs are named so similarly. Also, if you look at the Duration column, you'll see that the selected function it's referring to takes a negligible amount of time, so speeding it up wouldn't gain much overall.

is there any other approach to reduce the waiting time caused by the cudaStreamSynchronize?

The purpose of calling cudaStreamSynchronize is to wait for the GPU stream to finish the kernels queued up in it, so if you don't need the CUDA kernels launched by nvcomp to finish at that point, you don't need to call cudaStreamSynchronize there. Your program could continue running other computations, and call cudaStreamSynchronize when the results of the kernels are needed.

As for things you can do to make the compression kernels run faster, it will depend on the kernel and the data, but trying a different number of chunks to split the input data into may or may not help, and LZ4 has a data_type option in its nvcompBatchedLZ4Opts_t structure that can be NVCOMP_TYPE_CHAR (the default), NVCOMP_TYPE_SHORT, or NVCOMP_TYPE_INT for treating the input data as 1-byte, 2-byte, or 4-byte integers. The equivalent options structure for Deflate has an option for different algorithms to use that give different quality of compression vs. throughput.

andreamartini commented 2 months ago

The purpose of calling cudaStreamSynchronize is to wait for the GPU stream to finish the kernels queued up in it, so if you don't need the CUDA kernels launched by nvcomp to finish at that point, you don't need to call cudaStreamSynchronize there. Your program could continue running other computations, and call cudaStreamSynchronize when the results of the kernels are needed.

I call cudaStreamSynchronize on specific stream exactly when i need compressed data. I mean: First i copy data from hsot to device using cudaMemcpyAsync on _cudaStream Then i compress data using nvcomp api, i.e:

 nvcompBatchedZstdCompressAsync(
            _pDevice_inputData_as_Chunks_ptrs,
            _pDevice_inputData_chunks_size,
            _stChunkSize, // The maximum chunk size
            _stNumberOfChunks,
            _pDevice_temp_space_ptr,
            _stTemp_bytes,
            _pDevice_outputCompressed_Chunks_ptrs,
            _pDevice_compressed_chunks_size,
            _nvcompBatchedZStdOpts,
            _cudaStream);

Last i copy compressed data from device to host.

To do that, i use the following snippet code:

    // Compression in action ... 
    // Invoke the compression pointed by the function pointer _fpSpecializedCompression
        nvcompStatus_t compressionResult = (*this.*_fpSpecializedCompression)();

        if (compressionResult != nvcompSuccess)
        {
            bRet = false;
        }
        else
        {
            _stTotalCompressedSize = 0;

            /* Get compressed data ... */
            /* outputCompressedData is std::vector<std::vector<uint8_t>> */
            outputCompressedData.resize(_stNumberOfChunks);
            /*compressedChunksSize is std::vector<size_t> */
            compressedChunksSize.resize(_stNumberOfChunks);
           /* Get compressed chunks'size */
            gpuErrchk(cudaMallocHost(&_pHost_compressed_chunks_size, sizeof(size_t) * _stNumberOfChunks));
            gpuErrchk(cudaMemcpyAsync(_pHost_compressed_chunks_size, _pDevice_compressed_chunks_size, sizeof(size_t) * _stNumberOfChunks, cudaMemcpyDeviceToHost, _cudaStream));

            //  uint8_t** _pHost_readback_outputCompressed_Chunks_ptrs;
            gpuErrchk(cudaMallocHost(&_pHost_readback_outputCompressed_Chunks_ptrs, sizeof(size_t) * _stNumberOfChunks))

            // Copy each compressed chunk from device to host and then from host to
            // the std::vector used for saving compressed data to file
            for (size_t i = 0; i < _stNumberOfChunks; i++)
            {
                size_t chunkCompressedSize = _pHost_compressed_chunks_size[i];           
                gpuErrchk(cudaMallocHost(&_pHost_readback_outputCompressed_Chunks_ptrs[i], sizeof(uint8_t) * chunkCompressedSize));
                uint8_t* pHost_CompressedData_dst = _pHost_readback_outputCompressed_Chunks_ptrs[i];
                const uint8_t* pDevice_Compressedsrc = static_cast<uint8_t*>(_pHost_outputCompressed_Chunks_ptrs[i]);        
                gpuErrchk(cudaMemcpyAsync(pHost_CompressedData_dst, pDevice_Compressedsrc, sizeof(uint8_t) * chunkCompressedSize, cudaMemcpyDeviceToHost, _cudaStream));

                // POI 'A'
                /* I need to synchronize here, after the copy of each compressed chunk data*/
                gpuErrchk(cudaStreamSynchronize(_cudaStream));

                /*Copy the compressed chunk data to a vector which i use externally for saving data on file*/
                compressedChunksSize[i] = chunkCompressedSize;
                outputCompressedData[i].resize(chunkCompressedSize);
                memcpy(outputCompressedData[i].data(), pHost_CompressedData_dst, chunkCompressedSize * sizeof(uint8_t));
                _stTotalCompressedSize += chunkCompressedSize;
                // END POI 'A'
            }

         /* 
          // POI 'B'
          gpuErrchk(cudaStreamSynchronize(_cudaStream));
          for (size_t i = 0; i < _stNumberOfChunks; i++)
            {
                size_t chunkCompressedSize = _pHost_compressed_chunks_size[i];            
                uint8_t* pHost_CompressedData = _pHost_readback_outputCompressed_Chunks_ptrs[i];
                compressedChunksSize[i] = chunkCompressedSize;
                outputCompressedData[i].resize(chunkCompressedSize);
                memcpy(outputCompressedData[i].data(), pHost_CompressedData, chunkCompressedSize * sizeof(uint8_t));
                _stTotalCompressedSize += chunkCompressedSize;
            }
          // END POI 'B'
          */
        }

        cudaEventRecord(cudaEndEvent, _cudaStream);  // I record the cudaStartEvent on _cudaStream just before the copy of data from Host to device

        auto            tEndCompression = std::chrono::steady_clock::now();
        // tStartCompression = std::chrono::steady_clock::now() called before copy data from host to device
        _fmillisec_     elapsedTime = tEndCompression - tStartCompression;
        _milliseconds_  msElapsed = std::chrono::duration_cast<_milliseconds_>(elapsedTime);

        compressedReport.stCpuCompressionElapsedTime_ms = msElapsed.count();

        float fGpuCompressedElapsedTime_ms = 0.0f;
        cudaEventElapsedTime(&fGpuCompressedElapsedTime_ms, cudaStartEvent, cudaEndEvent);

In the reported code, i inserted 2 POI (Point of interest) A and B. Currently i'm using POI A. With POI A, i call cudaStreamSynchronize _stNumberOfChunks times, and i get the right compressed data, but the elapsed time seems too high (As shown in the images in the previous posts, on average, compression of 4MB takes 12 to 17 ms). If i comment POI A block and decomment POI B block, in order to call cudaStreamSynchronize only once, pHost_CompressedData is null: it seems not ready. So i have to use POI A block.

In case of conceptual errors, please could you suggest me the right way?

Moreover, could the time of 12 ms be a correct time for compression of a 4MB file? With a geforce gtx 1650 card, what might be a reference time? Or , finally, is the detected time between transfer data from host to device, compression, and transfer from device to host, a valid indicator for measuring compression performance effectiveness?

Thank you.

github-actions[bot] commented 1 month ago

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

andreamartini commented 1 month ago

Dear all, is there any suggestion?

github-actions[bot] commented 2 days ago

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

NVIDIA / nvcomp

[QST] Could the cudaStreamSynchronize method be a bottleneck? #105