Open sethrj opened 6 months ago
I did some CPU profiling using callgrind/cachegrind with the following setup:
RelWithDebInfo
build, CELERITAS_DEBUG=OFF
The graph below shows the estimated cycles spent in each function, weighting instruction fetch, L1, and LL cache miss.
I noticed that axpy leads to many instruction cache miss but it could be because I didn't pass march/mtune
compiler options.
Looking at the L1 read miss, most of them come from XsCalculator::get calls within XsCalculator::operator()
It'd be interesting to see the cache miss in a multithreaded scenario
@esseivaju Is this with one track slot or the usual number (65K)? I guess the reason I wondered about single-thread performance not being optimal is that we saw a substantial performance gap between single-slot and many-slot, and since the many-slot case is not really optimal (in terms of state cache locality and skipped loops due to masking) I wonder whether the call graph would look any different...
This is with 4k track slots.
single-thread performance not being optimal is that we saw a substantial performance gap between single-slot and many-slot,
Do you mean that in the single thread case, you saw better performance with one track slot?
OK 4k track slots, different than our usual regression CPU setting. What does the performance graph look like if you have a single track slot? (Make sure openmp is disabled! 😅) Because I would imagine that with a single track slot you'd get better cache performance for the particle state, even though the cache performance for the "params" data might go down.
Ok, I have some data with a single track slot. I had to set max_steps=-1
, and OpenMP is disabled at build time. Without profiling and just running the regression problem, it takes ~3x longer with one track slot.
Repeatedly calling ActionSequence:execute
has a large overhead because of dynamic_cast and freeing memory. I haven't located what is being freed but it's called exactly 20x per ActionSequence:execute
so each action is doing it at some point.
Regarding cache efficiency, it isn't helping that much. Below, I'm showing the L1 cache miss per call to AlongStepUniformMscAction::Execute
, (aggregate of instruction miss, +R/W miss) where most cache misses happen.
The first picture is for the single track slot scenario, the second picture is 65k track slots. As expected, you have way less miss per call since you process one track at a time, however, multiplied by how many times you have to call the function, it becomes way worse.
In both cases, ~80% of L1m is for instruction fetch.
@esseivaju Looks like the allocation is coming from the actions()->label
and passing into ScopedProfiling
. I'm opening a PR to use string_view
for the action labels/descriptions and to delay string allocation in the scoped profiling implementation.
One of the open questions for our CMS integration is how well the detectors will work if we invert the
[track, step]
loop to[step, track]
as is necessary for GPU. I believe we can without too much effort add enhanced support for a single-trackslot mode that would give us enhanced CPU performance and better integration characteristics.T
instead ofvector<T>
CoreStatePtr
becomesCoreState<value>*
instead ofCoreState<reference>*
so that we don't have to do additional indirection on each state item.