NVIDIA / cccl

CUDA Core Compute Libraries
https://nvidia.github.io/cccl/
Other
1.31k stars 165 forks source link

[BUG]: Suboptimal swap performance on universal vectors #2948

Open gevtushenko opened 22 hours ago

gevtushenko commented 22 hours ago

Is this a duplicate?

Type of Bug

Performance

Component

Thrust

Describe the bug

Swap of universal vectors is unnecessary slow and occupies more memory than it should. Device and host vectors have swap overloads in thrust:: namespace. These overloads call efficient .swap member function:

https://github.com/NVIDIA/cccl/blob/80031e29baa11e1674b7d30770badeca0fbdb5dc/thrust/thrust/device_vector.h#L539-L543

Universal and universal host pinned vectors are missing such an overload and the vector_base they alias to has swap overload in thrust::details, so it’s unreachable.

How to Reproduce


template <template <typename> class Vector>
void swap_time(const char* str) {
    Vector<char> a(1 << 30, 'a');
    Vector<char> b(1 << 30, 'b');

    auto begin = std::chrono::high_resolution_clock::now();
    thrust::swap(a, b);
    // a.swap(b);
    auto end = std::chrono::high_resolution_clock::now();
    std::cout << str << " swap time: " << std::chrono::duration<double>(end - begin).count() << " s" << std::endl;
    std::cout << "a: " << static_cast<char>(a[0]) << "; "
              << "b: " << static_cast<char>(b[0]) << std::endl;
}

int main() {
    swap_time<thrust::device_vector>("device");
    swap_time<thrust::host_vector>("host");
    swap_time<thrust::universal_vector>("universal");
    swap_time<thrust::universal_host_pinned_vector>("universal_host_pinned");
}

Expected behavior

Reproduction link

No response

Operating System

No response

nvidia-smi output

No response

NVCC version

No response