[EPIC] RAPIDS Should not need to patch CCCL

jrhemstad commented 4 months ago

RAPIDS currently has to apply a variety of patches to CCCL source code to work around various issues.

This issue is to track eliminating the need for RAPIDS to apply any patches to CCCL source code.

To start, for each patch we need a separate issue created with the following information:

Link to patch file
Summary of what the patch is for
Link to PR that added patch and any other relevant context
High level of proposed solution, or TBD

All issues should be added to this task list:

### Tasks
- [ ] test

trxcllnt commented 4 months ago

The patches we apply to all RAPIDS libraries:

revert_pr_211.diff (PR)

The patches we apply exclusively to cuDF:

bdice commented 4 months ago

I made some progress here. The patch reverting PR 211 is actually masking a bug in cuSpatial, which I fixed. This patch can be removed from rapids-cmake and cuDF. See these PRs I just opened. They should merge in this order:

Next steps would be to investigate the patches for only cuDF, specifically macros to reduce compile time (64 bit dispatch and fewer CUB arch policies).

jrhemstad commented 4 months ago

Next steps would be to investigate the patches for only cuDF, specifically macros to reduce compile time (64 bit dispatch and fewer CUB arch policies).

I just filed: https://github.com/NVIDIA/cccl/issues/1958

For fewer CUB arch policies, I'm still a bit confused tbh. My understanding of ChainedPolicy in CUB is that this patch should have no impact on compile time because CUB only instantiates kernels for the architectures you build for. I need @gevtushenko to look at that patch because he knows ChainedPolicy way better than I do.

bdice commented 4 months ago

All three PRs above have merged, which should fix up the RAPIDS CCCL devcontainer builds. The remaining cuDF patches should apply cleanly over the current CCCL. To continue removing those cuDF patches, I think #1958 is the next step.

gevtushenko commented 4 months ago

@jrhemstad thrust_faster_scan_compile_times.diff should not affect build times and can be deleted. cudf is compiled for Volata+. Mentioned patch essentially shifts the head of chained policy to cut older architectures. This saves a few instantiations of host functions but doesn't affect device code in any way. Since device code generation dominates the compilation time, this patch doesn't lead to compile time improvements. Here's one data point to prove this point:

With patch (chained policy starts at 600):

cat _deps/cccl-src/cub/cub/device/dispatch/tuning/tuning_scan.cuh | grep "ChainedPolicy<600"
      , ChainedPolicy<600, Policy600, Policy600>
for i in {1..10}; do echo "// comment " >> ../src/reductions/scan/scan_exclusive.cu; ninja > /dev/null; python ../scripts/sort_ninja_log.py .ninja_log | grep exclusive | awk -F',' '{print $1}' >> patched.csv; done; awk '{sum += $1} END {print sum/NR}' patched.csv
3407.3

cuobjdump --dump-sass CMakeFiles/cudf.dir/src/reductions/scan/scan_exclusive.cu.o | grep DeviceScanKernel | wc -l
480

Without patch (chained policy starts at 520):

cat _deps/cccl-src/cub/cub/device/dispatch/tuning/tuning_scan.cuh | grep "ChainedPolicy<600"
      , ChainedPolicy<600, Policy600, Policy520>
for i in {1..10}; do echo "// comment " >> ../src/reductions/scan/scan_exclusive.cu; ninja > /dev/null; python ../scripts/sort_ninja_log.py .ninja_log | grep exclusive | awk -F',' '{print $1}' >> unpatched.csv; done; awk '{sum += $1} END {print sum/NR}' unpatched.csv
3403.9

cuobjdump --dump-sass CMakeFiles/cudf.dir/src/reductions/scan/scan_exclusive.cu.o | grep DeviceScanKernel | wc -l
480

As you can see, the build time difference is within noise level. The number of kernels doesn't change as well. I believe the situation shouldn't be any different for reduction. But to be on the safe side, I'd recommend repeating the experiment for radix sort.

jrhemstad commented 4 months ago

@robertmaynard or @davidwendt would one of you mind checking to see if removing this patch from cudf does not impact compile time?

davidwendt commented 4 months ago

Reference this patch file: https://github.com/rapidsai/cudf/blob/branch-24.08/cpp/cmake/thirdparty/patches/thrust_faster_scan_compile_times.diff I ran several build tests on libcudf and the radix-sort and reduce patches are definitely still needed. Removing the tuning_scan.cuh patch showed less impact in time but enough to be flagged by the Build Metrics Report -- it highlights any time changes greater than 20%. Also the size of the libcudf.so increased by about 2% without the tuning_scan patch. I could probably be convinced to remove this one patch but since we really want the other 2 I'm not sure why I would at this time.

jrhemstad commented 3 months ago

I'm not sure why I would at this time.

Because CCCL is currently building RAPIDS from source as part of CCCL's CI. That CI job is currently broken because internal changes within Thrust are now breaking RAPIDS' patches. We want to reduce the surface area of RAPIDS' patches as quickly as possible.

robertmaynard commented 3 months ago

I'm not sure why I would at this time.

Because CCCL is currently building RAPIDS from source as part of CCCL's CI. That CI job is currently broken because internal changes within Thrust are now breaking RAPIDS' patches. We want to reduce the surface area of RAPIDS' patches as quickly as possible.

rapids-cmake patches failing to apply aren't a CMake error. This is infact currently leveraged in libcudf 24.06 where we apply two sets of patches to support different CCCL versions.

If I look at a recent CCCL PR ( https://github.com/NVIDIA/cccl/pull/2006 ) I see the RAPIDS cudf builds occurring without issue. If we look at the logs we see:

2024-07-18T21:41:11.4476761Z -- Found libcudacxx: /home/coder/cudf/cpp/build/conda/cuda-12.2/release/_deps/cccl-src/libcudacxx/lib/cmake/libcudacxx/libcudacxx-config.cmake (found suitable version "2.6.0.0", minimum required is "2.6.0.0")
2024-07-18T21:41:11.4490667Z -- Found Thrust: /home/coder/cudf/cpp/build/conda/cuda-12.2/release/_deps/cccl-src/thrust/thrust/cmake/thrust-config.cmake (found suitable exact version "2.6.0.0")
2024-07-18T21:41:11.4510794Z -- Found CUB: /home/coder/cudf/cpp/build/conda/cuda-12.2/release/_deps/cccl-src/cub/cub/cmake/cub-config.cmake (found suitable version "2.6.0.0", minimum required is "2.6.0.0")
2024-07-18T21:41:11.4532772Z -- Found CCCL: /home/coder/cudf/cpp/build/conda/cuda-12.2/release/_deps/cccl-src/lib/cmake/cccl/cccl-config.cmake (found version "2.6.0.0")
2024-07-18T21:41:11.4581525Z -- rapids-cmake [CCCL]: failed to apply diff thrust_disable_64bit_dispatching.diffrapids-cmake [CCCL]: git diff output: error: patch failed: thrust/thrust/system/cuda/detail/dispatch.h:44
2024-07-18T21:41:11.4582554Z -- error: thrust/thrust/system/cuda/detail/dispatch.h: patch does not apply
2024-07-18T21:41:11.4583282Z -- 
2024-07-18T21:41:11.4584169Z -- rapids-cmake [CCCL]: applied diff thrust_faster_sort_compile_times.diff to fix issue: 'Improve Thrust sort compile times by not unrolling loops for inlined comparators [https://github.com/rapidsai/cudf/pull/10577]'
2024-07-18T21:41:11.4585668Z -- rapids-cmake [CCCL]: applied diff thrust_faster_scan_compile_times.diff to fix issue: 'Improve Thrust scan compile times by reducing the number of kernels generated [https://github.com/rapidsai/cudf/pull/8183]'
2024-07-18T21:41:11.5233800Z -- CPM: Using local package fmt@10.1.1
2024-07-18T21:41:11.5260202Z -- CPM: Using local package CCCL@2.5.0

While we failed to apply the thrust_disable_64bit_dispatching patch the CMake process continued without issue.

NVIDIA / cccl

[EPIC] RAPIDS Should not need to patch CCCL #1939