The ECS system does a lot of swapping

graphitemaster commented 3 years ago

Been doing some performance profiling to see areas I could improve performance.

I've found that quite a bit of the time in simulation is spent in swapping components in the ECS framework. Is this strictly necessary? Note that swapping also requires looking up two hash tables to swap indices for source and destination too for each and every component. These lookups have an appreciable cost as well.

If there was no generational value on indices, hash tables wouldn't actually be needed since indices would be clean from 0. Though, even with generational values, masking it off and using the index would also work, meaning plain arrays would work.

Similarly, swapping of components in each system go through the whole process of allocating and constructing objects when in reality a std::swap() of src/dst is all that is necessary.

DanielChappuis commented 3 years ago

The idea is to store the components that are enabled (not sleeping) at the beginning of the array of components close to each other. Swapping of components only occurs when bodies move from sleeping to awake or the opposite in order to keep the awake bodies at the beginning of the array. Therefore, during the simulation, we can only loop over the components that are not sleeping at the beginning of the array. We do not need to loop over all the components of the world and test whether they are sleeping or not.

Here, we assume that each frame there are more bodies that are sleeping or are awake than bodies that are switching from awake to sleeping or the opposite because swapping should only occur when a bodies switches state.

Do you have any ideas how to improve this and avoid swapping ? If we store all the components continuously in the array without swapping, we need to loop over all the components each frame and check whether they are awake or asleep. This does not sounds good in a scene where a lot of bodies are sleeping.

graphitemaster commented 3 years ago

That is a good question. I don't have a good answer. Iterating just all the awake bodies's components does appear to be a the most cache efficient method, but the movement of all that data to produce that is inherently cold and cache thrashing.

What about something like this: Have a bitset for each component type where bit n (when set) indicates that component n has gone from either a sleeping state to a non-sleeping state, or vise-versa (thus it's been "affected"). When bodies transition sleep state we just mark this bitset instead of performing the component swaps.

At the beginning of the frame (before enumerating the components), if we can find a set bit in any of these bitsets, then we should consider swapping them, except using a better method here. We count the number of set bits proceeding (and including) it, and then do a contiguous swap on the whole span with a memmove instead. Here is a perfect use for popcount instruction as well.

I'd drop the hash tables here as well, it only needs to be an array that maps one index to another and since indices will start from 0, the hash is not necessary, that saves some time and it also saves the hash lookup which may involve traversing some linked lists. This also means the adjustment of the indices can be handled with the same memmove as well, with the only additional thing being an update of the src index.

At the end we just clear this bitset.

graphitemaster commented 3 years ago

To provide an example of what I mean,

T array[N]; // some array for components of type T
BitSet sleep; // bitset of N bits indicating when component has gone to sleep.
BitSet awake; // bitset of N bits indicating when component has been made awake.

// all this has to do is set and clear bits in each set, respectively, nothing actually moves around in memory
// when it happens within a step, so a body going to sleep and awake then back to sleep is still only one
// move instead of three.
void do_awake(...) {
  sleep.clear(component.index);
  awake.set(component.index);
}
void do_sleep(...) {
  sleep.set(component.index);
  awake.clear(component.index);
}

// every step
int index = sleep.find_first_set_bit(); // index of first put to sleep component
if (index == 0) return; // nothing to do, no components need to move

int offset = 0;
while (index < N) {
  // count number of consecutive set bits from index.
  int count = sleep.count_set_bits_from(index);
  // move the aggregate of count elements from awake to sleep.
  memmove(array + offset, array + index, count * sizeof(T));
  offset += count;
  // find the next one, should return N (i.e end) when exhausted.
  index = sleep.find_next_set_bit_from(index);
}

// now [offset, N) contains possibly-awake components, while [0, offset) are asleep.
sleep.clear(); // make all the bits 0

// now we need to do the opposite  but only from [0, offset) as these are "asleep"
index = awake.find_first_set_bit(); // index of first to awaken component
start = 0;
while (index < offset) {
  int count = awake.count_set_bits_from(index);
  // move the aggregate of count elements from sleep to awake.
  memmove(array + offset - start, array + index, count * sizeof(T));
  start += count;
  index = awake.find_next_set_bit_from(index); // find the next one
}

// now we have
// [0, start) contains sleeping components
// [start, N) contains awake components

awake.clear(); // make all the bits 0

The main idea behind this is that bodies going to sleep and awake within a step are only ever moved once, rather than multiple times (as it appears is the case, correct me if I'm wrong.) Similarly, this avoids the need to move components one at a time and takes advantage of the fact that there is likely multiple consecutive components that can be moved together in unison using memmove here which is often instruction-level parallelized. Another benefit of this is that this maintains an invariant the current method does not and that is groups of transitioned components maintain order. That's extremely useful as it means an ordered array mapping indices to components can just be updated by moving the elements of the array values like we do the components here, removing the need for a hash table.

I also want to point out that this is somewhat backwards from the current setup as we prefer to put the awake components in the top halve of the array, as opposed to the bottom halve of the array. This is actually a good thing because the simulation asymptotically approaches a resting state, meaning you'll have more sleeping components than awake components, so the distance that one needs to move the data is reduced overall, which should improve cache efficiency.

graphitemaster commented 3 years ago

I also want to point out this BitSet type can be implemented very efficiently because operations like find first set have native instructions (check out popcount, _BitScanForward, _BitScanReverse, __builtin_popcount, etc). Those can only operate within a word, and a bitset will need to be composed of multiple words. So you can find the indices, count the spans, etc, all very fast, faster than enumerating an array and checking them.

DanielChappuis / reactphysics3d

The ECS system does a lot of swapping #199