Closed mahmoodw closed 9 months ago
This feature would be far cleaner to implement within the DataInitialization class, turning m_gpuInputs
into a vector.
Why many CI tests are failing?
By the way, can we have a test case in precheckin or extended to verify this feature?
Why many CI tests are failing?
By the way, can we have a test case in precheckin or extended to verify this feature?
I'll explore the CI failures. I was hoping I could get some suggestions for a subset of tests to replicate for this feature.
This feature would be far cleaner to implement within the DataInitialization class, turning
m_gpuInputs
into a vector.
I'll look into encapsulating these changes in the class.
Why many CI tests are failing? By the way, can we have a test case in precheckin or extended to verify this feature?
I'll explore the CI failures. I was hoping I could get some suggestions for a subset of tests to replicate for this feature.
This is some quick tips.
(1) static analysis just follow the suggestions in the log (2) precheckin it failed at the first yaml. The command line is normally in the log. python3 ./Tensile/bin/Tensile Tensile/Configs/build_client.yaml /tmp/.tensile-tox/py3/client pid=342
Please fix static-analysis fail. Other fails seem to be known issues. You can disable them by merging the latest change in develop.
This feature would be far cleaner to implement within the DataInitialization class, turning
m_gpuInputs
into a vector.
After exploring this more it seems simple enough to only add that to DataInitialization, but looks quite a lot more complicated to actually simplify the main loop code by modifying ContractionInputs and then KernelInvocation to hide the extra logic. Please share some insight if you see something that I am missing.
CI test still fails. Not sure if it is related to your change. It might be better to re-run CI test.
This PR adds the 2 arguments discussed between rocBLAS and Tensile group, which determine the number of buffers that will be cycled through during the hot timing loop. See argument description for usage.
The overall strategy is to create enough buffers and cycle through them on each iteration of the hot loop so that when the same buffer is repeated, the cache no longer contains those values.