ROCm / Tensile

Stretching GPU performance for GEMMs and tensor contractions.
MIT License
218 stars 147 forks source link

Adding option of rotating buffers for timing with cache eviction #1848

Closed mahmoodw closed 9 months ago

mahmoodw commented 10 months ago

This PR adds the 2 arguments discussed between rocBLAS and Tensile group, which determine the number of buffers that will be cycled through during the hot timing loop. See argument description for usage.

The overall strategy is to create enough buffers and cycle through them on each iteration of the hot loop so that when the same buffer is repeated, the cache no longer contains those values.

sdquiring commented 10 months ago

This feature would be far cleaner to implement within the DataInitialization class, turning m_gpuInputs into a vector.

nakajee commented 10 months ago

Why many CI tests are failing?

By the way, can we have a test case in precheckin or extended to verify this feature?

mahmoodw commented 10 months ago

Why many CI tests are failing?

By the way, can we have a test case in precheckin or extended to verify this feature?

I'll explore the CI failures. I was hoping I could get some suggestions for a subset of tests to replicate for this feature.

mahmoodw commented 10 months ago

This feature would be far cleaner to implement within the DataInitialization class, turning m_gpuInputs into a vector.

I'll look into encapsulating these changes in the class.

nakajee commented 10 months ago

Why many CI tests are failing? By the way, can we have a test case in precheckin or extended to verify this feature?

I'll explore the CI failures. I was hoping I could get some suggestions for a subset of tests to replicate for this feature.

This is some quick tips.

(1) static analysis just follow the suggestions in the log (2) precheckin it failed at the first yaml. The command line is normally in the log. python3 ./Tensile/bin/Tensile Tensile/Configs/build_client.yaml /tmp/.tensile-tox/py3/client pid=342

nakajee commented 9 months ago

Please fix static-analysis fail. Other fails seem to be known issues. You can disable them by merging the latest change in develop.

mahmoodw commented 9 months ago

This feature would be far cleaner to implement within the DataInitialization class, turning m_gpuInputs into a vector.

After exploring this more it seems simple enough to only add that to DataInitialization, but looks quite a lot more complicated to actually simplify the main loop code by modifying ContractionInputs and then KernelInvocation to hide the extra logic. Please share some insight if you see something that I am missing.

nakajee commented 9 months ago

CI test still fails. Not sure if it is related to your change. It might be better to re-run CI test.