Optimize Mass Processor Performance

Leroy231 commented 1 year ago

General

[ ] Audit all places LineTraceSingleByChannel and related methods are used to see if we need result or just bool; if just bool, switch to LineTraceTestByChannel and related methods
[ ] Reduce number of actors in DA_MassSoldier_Team1 and DA_MassSoldier_Team2?
[ ] Consider optimizing Mass processors in multiplayer by taking short cuts if no human players nearby, i.e. simulation LOD will be calculated based on multiple observers instead of just one
[ ] https://www.intel.com/content/www/us/en/develop/documentation/vtune-cookbook/top/tuning-recipes/frequent-dram-accesses.html to find frequent RAM hits instead of using CPU cache
[ ] If memory usage becomes a concern, auditing fragments and change it so we add and remove fragments that won’t be on entities for a long time, e.g. FTargetEntityFragment could be added when we detect target and removed afterwards; then do we even need a tag to track having a target, or can we just rely on presence of the fragment?

Entity Spawning

[ ] Spawning of lots of projectiles each tick is slow. Might be faster if we batch all spawning of projectiles in each tick, e.g. in FireProjectileTask add to a queue or array and then have some code that runs at end of tick that spawns them in single call to SpawnEntity for each EntityTemplateID. Maybe a subsystem. Make it generic so can be reused for effect entities too.

MassStateTreeProcessor

See https://github.com/HaywireInteractive/OnAllFronts-Public/issues/481

UMassCoverQueryProcessor

[ ] Always fail cover for off LOD?
[ ] Switch to ParallelForEachEntityChunk. UEnvQueryManager is not thread safe, but perhaps can create an instance for each parallel call. Note that variable tick with ParallelForEachEntityChunk is broken in 5.0.3 and 5.1. Perhaps we should switch to ParallelFor.

UMassScreensizeLODCollectorProcessor

[ ] Once we're on UE 5.2 and ParallelForEachEntityChunk is available and doesn't crash for queries with chunk filters, we should switch this processor to use ParallelForEachEntityChunk. If that's not feasible, we can switch to ParallelFor.

UMassNavMeshMoveProcessor

[ ] Reduce number of RAM reads required during processor execution possibly by caching data in fragments
[ ] What if Squad was a single Mass Entity with array of all data needed e.g. object with transform for each soldier, move target etc. Then all processors would need to be updated to handle squad correctly. Alternatively, can we make all squad entities packed in order within single chunk so that we don't have to hit RAM?
[ ] Instead of checking if entity is Squad leader or squad member via EntityView RAM access, set Mass tags

UMassSoundPerceptionSubsystem

[ ] UMassSoundPerceptionSubsystem: grid cell size may not be ideal
[ ] FSoundPerceptionHashGrid2D: 2 levels of hierarchy, 4 ratio between levels may not be ideal

UMassAudioPerceptionProcessor

[ ] Make variable tick?
- [ ] We'd just need to also adjust this value here so that sounds stay alive more than just 2 frames: https://github.com/LeroyTechnologies-Org/ProjectM-Private-Staging/blob/1b181e4b864e00b869de58c83f5bc3d382189c96/Source/ProjectM/Public/MassSoundPerceptionSubsystem.h#L43
- [ ] Add logic so that a soldier never "processes" the same sound more than once, maybe by keeping the LastProcessedSoundID in a fragment
[ ] Is there a low level line trace function that doesn’t require locking which may be faster when doing parallel line traces?

UMassNavigationSubsystem

[ ] UMassNavigationSubsystem: has grid cell size of 2.5m, might be too small to fit tank agent radius thus all tanks go into spill list

UMassEnemyTargetFinderProcessor

[x] Instead of using Masked Occlusion Culling algorithm, consider chunking 90 deg FOV into buckets e.g. 10 degrees where we bucket every friendly into array where index is degree bucket and value is closest friendly distance. If very close, bucket them in multiple buckets. Then when we consider each enemy, we check if there is a closer friendly in that bucket and if so, skip.
[ ] Can we filter PostSphereTraceEntityQuery somehow to only include entities that had sphere trace, e.g. via a Mass tag?
[ ] UMassEnemyTargetFinderProcessor: Would it be faster to measure distance to each unhittable entity and ignore those that are farther than found enemy target? Add a flag and measure
[ ] UMassEnemyTargetFinderProcessor: Would it be faster to have two separate THierarchicalHashGrid2D for each team, and only scan those for enemies? That way teammates are already ignored. Would still have to scan same team's grid for obstacles in the way though.
[ ] Instead of FCollisionQueryParams::DefaultQueryParam, use FCollisionQueryParams with bTraceComplex=false. See how we do it in DoLineTraces() in MassAudioPerceptionProcessor.cpp.

UInvalidTargetFinderProcessor

[ ] UInvalidTargetFinderProcessor: Currently when entity and target entity are offset a lot by x and y coordinates, we search a large box unnecessarily. Instead we could break down search into lots of small boxes covering the line between entity and target entity. Would need to measure if this is actually faster though, as running many small queries might not be a win.
[ ] UInvalidTargetFinderProcessor: Use ParallelForEach. Need a queue for stuff done in ProcessEntity().
[ ] UInvalidTargetFinderProcessor: Break search into phases like UMassEnemyTargetFinderProcessor if needed.

UMassProjectileDamageProcessor

[ ] Could we avoid having to use FMassEntityView which is slow since it requires RAM access? We could store more data in the 2D hash grid. Data needed: Capsule (collision), FTeamMemberFragment (for adding sound perception), FTransformFragment (for splash damage), FMassHealthFragment (for dealing damage), FProjectileDamagableFragment (for deciding if can deal damage), FMassPlayerControllableCharacterTag (for handling player death).

UMassTargetGridProcessor

[ ] Parallelize some of the work, e.g. finding which entities need to be updated
[ ] Variable tick?

Optimizations attempted that regressed performance

MassEnemyTargetFinderProcessor: Use ParallelFor instead of ParallelForEachEntityChunk. This makes things worse because there are a lot of chunks which means this creates tons of small ParallelFor calls with extra overhead for each.
MassEnemyTargetFinderProcessor: Use ParallelFor within ParallelForEachEntityChunk. See perf benefit of reverting this here: https://github.com/LeroyTechnologies/ProjectM/commit/d7ec42d6cfc3b6a60a152b24782dacc453fb964c
MassEnemyTargetFinderProcessor: In AreEntitiesBlockingTarget use ParallelFor. Slower with ParallelFor because in regular for we could short circuit the loop.
- ParallelFor
- I.Max I.Avg I.Med I.Min
- 29.77 3.97 2.78 0.41
- regular for
- I.Max I.Avg I.Med I.Min
- 8.53 1.32 1.18 0.07

aelmod commented 1 year ago

Hey! Any chance to access the updated repo? I like the idea of the project and I deal with MassEntity and would be interested in contributing as well