celeritas-project / celeritas

Celeritas is a new Monte Carlo transport code designed to accelerate scientific discovery in high energy physics by improving detector simulation throughput and energy efficiency using GPUs.
https://celeritas-project.github.io/celeritas/user/index.html
Other
62 stars 33 forks source link

Implement single-track CPU for performance and improve integration #1172

Open sethrj opened 6 months ago

sethrj commented 6 months ago

One of the open questions for our CMS integration is how well the detectors will work if we invert the [track, step] loop to [step, track] as is necessary for GPU. I believe we can without too much effort add enhanced support for a single-trackslot mode that would give us enhanced CPU performance and better integration characteristics.

  1. Add a new memspace for "compact host"
  2. Value type for compact host stores T instead of vector<T>
  3. CoreStatePtr becomes CoreState<value>* instead of CoreState<reference>* so that we don't have to do additional indirection on each state item.
esseivaju commented 5 months ago

I did some CPU profiling using callgrind/cachegrind with the following setup:

The graph below shows the estimated cycles spent in each function, weighting instruction fetch, L1, and LL cache miss.

testem3_fm_64p

I noticed that axpy leads to many instruction cache miss but it could be because I didn't pass march/mtune compiler options.

Looking at the L1 read miss, most of them come from XsCalculator::get calls within XsCalculator::operator()

It'd be interesting to see the cache miss in a multithreaded scenario

sethrj commented 5 months ago

@esseivaju Is this with one track slot or the usual number (65K)? I guess the reason I wondered about single-thread performance not being optimal is that we saw a substantial performance gap between single-slot and many-slot, and since the many-slot case is not really optimal (in terms of state cache locality and skipped loops due to masking) I wonder whether the call graph would look any different...

esseivaju commented 5 months ago

This is with 4k track slots.

single-thread performance not being optimal is that we saw a substantial performance gap between single-slot and many-slot,

Do you mean that in the single thread case, you saw better performance with one track slot?

sethrj commented 5 months ago

OK 4k track slots, different than our usual regression CPU setting. What does the performance graph look like if you have a single track slot? (Make sure openmp is disabled! 😅) Because I would imagine that with a single track slot you'd get better cache performance for the particle state, even though the cache performance for the "params" data might go down.

esseivaju commented 5 months ago

Ok, I have some data with a single track slot. I had to set max_steps=-1, and OpenMP is disabled at build time. Without profiling and just running the regression problem, it takes ~3x longer with one track slot.

callgrind_estimate_singletrack

Repeatedly calling ActionSequence:execute has a large overhead because of dynamic_cast and freeing memory. I haven't located what is being freed but it's called exactly 20x per ActionSequence:execute so each action is doing it at some point.

Regarding cache efficiency, it isn't helping that much. Below, I'm showing the L1 cache miss per call to AlongStepUniformMscAction::Execute, (aggregate of instruction miss, +R/W miss) where most cache misses happen.

The first picture is for the single track slot scenario, the second picture is 65k track slots. As expected, you have way less miss per call since you process one track at a time, however, multiplied by how many times you have to call the function, it becomes way worse.

image image

In both cases, ~80% of L1m is for instruction fetch.

sethrj commented 5 months ago

@esseivaju Looks like the allocation is coming from the actions()->label and passing into ScopedProfiling. I'm opening a PR to use string_view for the action labels/descriptions and to delay string allocation in the scoped profiling implementation.