Closed charleskawczynski closed 8 months ago
All modified and coverable lines are covered by tests :white_check_mark:
Comparison is base (
f2735f2
) 92.91% compared to head (ca56ec8
) 92.91%.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
I think the original kernel benchmark I added was very memory bound (from the output array), as a result performance changes would be hidden by the bandwidth of loading the output array.
What I think we're really interested in here is flops inside a realistic kernel, and memory reads necessary for a single thermo state.
I've updated the kernel benchmark to address the first part (flops), and we should write a test for the memory reads necessary for a single thermo state, however, it will require a bit more work, and it'll be somewhat brittle to compare a broadcast call vs a custom kernel. Interesting, the kernel in this PR seems to be about 3x faster than the broadcast function call, I'll need to dig into that later.