ROCm / Tensile

Stretching GPU performance for GEMMs and tensor contractions.
MIT License
212 stars 145 forks source link

CMake cleanup to prevent redundant work in client builds #1917

Closed ellosel closed 3 months ago

ellosel commented 4 months ago

The CMake support in Tensile (namely TensileConfig.cmake) is doing unnecessary work. In turn, the configure/build times in downstream projects can take up to 20% longer than is necessary. The unnecessary work is due to:

  1. Use of execute_process to invoke TensileCreateLibrary --generate-manifest-and-exit which is eagerly run at configuration time regardless if manifest generation is required.
  2. Using the contents of TensileManifest.txt as output for a subsequent add_custom_command call that is used to generate Tensile build artifacts.

In theory, it is reasonable to use the contents of the manifest as output for a subsequent add_custom_command call because this will ensure that the command is not considered satisfied unless all of the files in the manifest are generated. However, CMake needs to know what those files are at configuration time when it is configuring the custom command. Therefore, one must use execute_process to generate the manifest which has the consequence of redundantly calling to TensileCreateLibrary --generate-manifest-and-exit any time a client reconfigures whether or not it is necessary (and TensileCreateLibrary --generate-manifest-and-exit is expensive).

In practice, we would like to invoke TensileCreateLibrary once at build time prior to building dependent targets. By relaxing the output requirements of the custom command to just the manifest file. We can eliminate the execute_process call altogether. Further, we can provide an analog to checking that the custom command is satisfied (i.e. that it generated all of the expected files) by extending TensileCreateLibrary to enforce said check.

This PR includes the necessary changes to avoid the redundant work described above. Further, minor changes to verbosity related logic in TensileCreateLibrary were added to ensure users can turn off all output from Tensile.

The current default is set to turn off most of the Tensile output which gives the following output when configuring/building in the client (note the output at 0% which comes from the custom command in TensileConfig):

make -j32 install
[  0%] Generating libraries with TensileCreateLibrary
[  0%] Generating /mnt/host/projects/rocBLAS-internal/build/release/include/rocblas/internal/rocblas_device_malloc.hpp
[  1%] Generating prototypes from /mnt/host/projects/rocBLAS-internal/library/src.
[  1%] Built target rocblas_device_malloc
[  1%] Built target rocblas_proto_templates
Tensile::WARNING: Global parameter WriteMasterSolutionIndex = False unrecognized.
[  1%] Verifying files in /mnt/host/projects/rocBLAS-internal/build/release/Tensile/library/TensileManifest.txt were generated
Tensile::WARNING: Global parameter WriteMasterSolutionIndex = False unrecognized.
[  1%] Built target TENSILE_LIBRARY_TARGET
[  1%] Building CXX object Tensile/lib/CMakeFiles/TensileHost.dir/source/ArithmeticUnitTypes.cpp.o
[  1%] Building CXX object Tensile/lib/CMakeFiles/TensileHost.dir/source/AMDGPU.cpp.o

Users can reconfigure with -DTensile_VERBOSE=1 or -DTensile_VERBOSE=2 to get extra tensile output.

Initial testing includes:

ellosel commented 3 months ago

In general, I think the changes are good to help reduce the build time. Most of the code movement makes sense!

I would encourage checking a few more things while we are at it:

  1. Are the rocBLAS builds compatible with the changes? I think the original work was to prevent rocBLAS from building the entire tensile library multiple times (during cmake stage then during the actual build).
  2. If we were to change the contents of a YAML config file (say working on a library contents), would a tensile dirty build detect the changes and build accordingly with new manifest and libs copied to the correct place after? Or do you have to build the entire library from scratch again?
  3. Some tests appear to have been removed. Good idea to verify they are not needed anymore.
  1. Yes I tested these changes against the current version of develop to confirm we get the behavior we are after. rocBLAS only invokes TensileCreateLibrary during the build stage. There is one issue that remains in rocBLAS that I was going to fix once this is merged (i.e. the virtual env cmake scripts eagerly blow away the tensile install which has the effect of adding some extra config time).
  2. Correct, in this implementation, the CMake support is setup to observe changes in the Logic dir and retrigger calls to TensileCreateLibrary as needed. If the logic files don't change then nothing happens. This is a nice feature when you want to do incremental development on rocBLAS without making changes that would impact Tensile artifacts. Currently, every time someone configures rocBLAS they have to eagerly run tensile.
  3. Please let me know what tests and I will take another look!
cgmillette commented 3 months ago

LGTM pending CI and newline EOF fix

ellosel commented 3 months ago

Is DEPENDS_EXPLICIT_ONLY applicable to the uses of add_custom_command? In theory, it would be possible to run the Tensile build in parallel with the build of the library when compiling with ninja. The linking is the only part that actually needs to be serialized, but add_custom_command will force the serialization of the build by default.

Whether any build system has enough memory to build Tensile in parallel with the library perhaps another question, but it may be a nice option to have available.

Good find and I agree that would be nice. It looks like DEPENDS_EXPLICIT_ONLY was made available in cmake 3.27 which isn't available on all of the supported rocm platforms e.g. ubuntu 20 is 3.16. Currently, we have cmake_minimum_required set to 3.13.

cgmb commented 3 months ago

It looks like DEPENDS_EXPLICIT_ONLY was made available in cmake 3.27 which isn't available on all of the supported rocm platforms e.g. ubuntu 20 is 3.16. Currently, we have cmake_minimum_required set to 3.13.

One approach for those sorts of enhancements is to conditionally enable them. e.g.

set(command_options "")
if(CMAKE_VERSION VERSION_GREATER_EQUAL 3.27)
  set(command_options DEPENDS_EXPLICIT_ONLY)
endif()
add_custom_command(mycommand
  ${command_options}
  ...
)
TorreZuk commented 3 months ago

Not sure what you are trying to build in parallel, but if using all the cores for each stage simpler to debug if steps are done in order. I would do additions in a new PR if experimenting.