Would it be possible to convert cudaMallocPitch calls to cudaMalloc? I understand why cudaMallocPitch was chosen, but those limitations are not as noticeable today with larger cache sizes.
The main driver for this enhancement is for optimal functionality with DALI. DALI loads batches of images using cudaMalloc. The reason being that DALI is not concerned with what is being loaded and it could be something besides an image.
Currently, the image must be copied from its cudaMalloc location to the new cudaMallocPitch location. If CudaSift used cudaMalloc, operation could then be performed in-place. This would save memory and time.
Would it be possible to convert cudaMallocPitch calls to cudaMalloc? I understand why cudaMallocPitch was chosen, but those limitations are not as noticeable today with larger cache sizes.
The main driver for this enhancement is for optimal functionality with DALI. DALI loads batches of images using cudaMalloc. The reason being that DALI is not concerned with what is being loaded and it could be something besides an image.
Currently, the image must be copied from its cudaMalloc location to the new cudaMallocPitch location. If CudaSift used cudaMalloc, operation could then be performed in-place. This would save memory and time.