NVIDIA / CUDALibrarySamples

CUDA Library Samples
Other
1.5k stars 311 forks source link

Related issues about using NvJPEG to implement image encoding and decoding on RTX3060 #194

Open OroChippw opened 2 months ago

OroChippw commented 2 months ago

Thanks to the contribution of this warehouse, I am a beginner of NvJPEG, trying to use RTX 3060 to compress png or bmp images, and raise a few questions as follows: Resolution of input image: 8432 * 40000 Experimental version: CUDA 11.6 image

  1. It takes about 334ms to encode and compress an image of this size on the 3060 GPU, and about 12xxms to decode. Is this time consumption normal?
  2. After I finish decoding, I try to convert the image from nvjpeg_image to cv::Mat format. In the getCVImage in the figure above, I use a loop for this process. Is there a faster way?

    cv::Mat NvjpegCompressRunnerImpl::getCVImage(const unsigned char *d_chanB, int pitchB, \
                                             const unsigned char *d_chanG, int pitchG, \
                                             const unsigned char *d_chanR, int pitchR, \
                                             int width, int height) 
    {
    cudaEvent_t start, end;
    float milliseconds = 0.0;
    CHECK_CUDA(cudaEventCreate(&start));
    CHECK_CUDA(cudaEventCreate(&end));
    
    CHECK_CUDA(cudaEventRecord(start));
    
    cv::Mat cvImage(height, width, CV_8UC3); //BGR
    std::vector<unsigned char> vchanR(height * width);
    std::vector<unsigned char> vchanG(height * width);
    std::vector<unsigned char> vchanB(height * width);
    unsigned char *chanR = vchanR.data();
    unsigned char *chanG = vchanG.data();
    unsigned char *chanB = vchanB.data();
    
    CHECK_CUDA(cudaMemcpy2D(chanR, (size_t)width, d_chanR, (size_t)pitchR, \
                    width, height, cudaMemcpyDeviceToHost));
    CHECK_CUDA(cudaMemcpy2D(chanG, (size_t)width, d_chanG, (size_t)pitchR, \
                    width, height, cudaMemcpyDeviceToHost));
    CHECK_CUDA(cudaMemcpy2D(chanB, (size_t)width, d_chanB, (size_t)pitchR, \
                    width, height, cudaMemcpyDeviceToHost));
    
    for (int y = 0; y < height; y++) 
    {
        for (int x = 0; x < width; x++) 
        {
            cvImage.at<cv::Vec3b>(y, x) = cv::Vec3b(chanB[y * width + x], chanG[y * width + x], chanR[y * width + x]);
        }
    }
    
    CHECK_CUDA(cudaEventRecord(end));
    CHECK_CUDA(cudaEventSynchronize(end));
    
    CHECK_CUDA(cudaEventElapsedTime(&milliseconds, start, end));
    
    CHECK_CUDA(cudaEventDestroy(start));
    CHECK_CUDA(cudaEventDestroy(end));
    
    std::cout << "=> getCVImage execution time: " << milliseconds << " ms" << std::endl;
    
    return cvImage;
    }
  3. If I want to encode and decode the same picture on a RTX 1030 GPU, it will crash directly. Should the large picture be divided into small pictures? Can multiple small pictures be compressed asynchronously? Thank you again for your contribution, looking forward to your reply, thank you
zohebk-nv commented 1 month ago

It takes about 334ms to encode and compress an image of this size on the 3060 GPU, and about 12xxms to decode. Is this time consumption normal?

The number is plausible taking into account the size of your image. If possible, please use nsys(nsight systems) tool to generate a profile, this can help confirm that there are no other bottlenecks.

After I finish decoding, I try to convert the image from nvjpeg_image to cv::Mat format. In the getCVImage in the figure above, I use a loop for this process. Is there a faster way?

I'm not too familiar with cv::Mat, so wont be able to answer your question definitively. However, I did find this link(https://answers.opencv.org/question/134322/initialize-mat-from-pointer-help/) on opencv.org which seems similar to your question. Hope this helps.

If I want to encode and decode the same picture on a RTX 1030 GPU, it will crash directly.

Would it be possible for you try with a recent cuda toolkit(12.5) to see if the crash can be reproduced? We've made a lot of fixes since cuda 11.6. If you still see the crash, it will helpful if you can share a self contained reproducer code so that we can root cause this at our end.

Should the large picture be divided into small pictures? Can multiple small pictures be compressed asynchronously?

If this is on GTX 1030, dividing into smaller pictures will help since GT1030 only has 2GB of memory. Small images can be asynchronously compressed to an extent. Synchronization will be required when retrieving compressed file to memory. You will have try with multiple instances of nvjpeg encoder to achieve asynchronous compression.

OroChippw commented 1 month ago

It takes about 334ms to encode and compress an image of this size on the 3060 GPU, and about 12xxms to decode. Is this time consumption normal?

The number is plausible taking into account the size of your image. If possible, please use nsys(nsight systems) tool to generate a profile, this can help confirm that there are no other bottlenecks.

After I finish decoding, I try to convert the image from nvjpeg_image to cv::Mat format. In the getCVImage in the figure above, I use a loop for this process. Is there a faster way?

I'm not too familiar with cv::Mat, so wont be able to answer your question definitively. However, I did find this link(https://answers.opencv.org/question/134322/initialize-mat-from-pointer-help/) on opencv.org which seems similar to your question. Hope this helps.

If I want to encode and decode the same picture on a RTX 1030 GPU, it will crash directly.

Would it be possible for you try with a recent cuda toolkit(12.5) to see if the crash can be reproduced? We've made a lot of fixes since cuda 11.6. If you still see the crash, it will helpful if you can share a self contained reproducer code so that we can root cause this at our end.

Should the large picture be divided into small pictures? Can multiple small pictures be compressed asynchronously?

If this is on GTX 1030, dividing into smaller pictures will help since GT1030 only has 2GB of memory. Small images can be asynchronously compressed to an extent. Synchronization will be required when retrieving compressed file to memory. You will have try with multiple instances of nvjpeg encoder to achieve asynchronous compression.

Thank you very much for your reply. I used the pointer of opencv to construct cv::Mat, which has improved the speed a lot. Is there any relevant sample for reference for CUDA's Nvjpeg asynchronous stream compression? Thank you