NVIDIA-Genomics-Research / GenomeWorks

SDK for GPU accelerated genome assembly and analysis
https://clara-parabricks.github.io/GenomeWorks/
Apache License 2.0
281 stars 76 forks source link

[cudamapper] Small improvements to IndexGPU #543

Closed ahehn-nv closed 4 years ago

ahehn-nv commented 4 years ago

While reviewing a PR I noticed a) cudaStreamSynchronize() is missing a GW_CU_CHECK_ERR in https://github.com/clara-parabricks/GenomeWorks/blob/d715ab18b9a704726350613b6bb248a741b0d9f3/cudamapper/src/index_gpu.cuh#L781

b) I think the block around the mentioned cudaStreamSynchronize():

    cudautils::device_copy_n(merged_basepairs_h.data(), ...,  cuda_stream_); // H2D

    cudaStreamSynchronize(cuda_stream_);
    merged_basepairs_h.clear();
    merged_basepairs_h.shrink_to_fit();

    // sketch elements get generated here
    auto sketch_elements = SketchElementImpl::generate_sketch_elements(..., cuda_stream_);

could be changed to

    cudautils::device_copy_n(merged_basepairs_h.data(), ...,  cuda_stream_); // H2D

    // sketch elements get generated here
    auto sketch_elements = SketchElementImpl::generate_sketch_elements(..., cuda_stream_);

    cudaStreamSynchronize(cuda_stream_);
    merged_basepairs_h.clear();
    merged_basepairs_h.shrink_to_fit();

which could potentially allow for a bit more overlapping. @mimaric ?

mimaric commented 4 years ago

I agree about a)

You're also right about b), as both H2D copy and SketchElementImpl::generate_sketch_elements happen on the same stream there is no need to sync in between.

In fact I think I was too aggressive with deallocating host memory in indexer, no host memory is allocated after this point, so that memory can be deallocated at the end of the function, but a sync would be needed at the end of the function (i.e. before deallocating merged_basepairs_d) in that case. Anyway, there are a lot inefficiencies in IndexGPU and SketchElementImpl. For now Ill only address your commentsa)andb)` and leave further optimizations for a later point.