Search for optimizations in the AI processing speed

abmpicoli2 commented 6 years ago

@Nightinggale comment: I figure the time spent by the AI is the time we should consider when optimizing. In most cases the game is fast enough during your turn and the slow code like getting yield production from a plot with modifiers is also used by the AI

@ShadesOT comment: That is something I noticed while studying the code there are functions of magnitude n^2 or worse by n^2 I mean o (n^2) well, maybe it is too early to say something like o(n^2), i did not study the code there in depth.

abmpicoli2 commented 6 years ago

Maybe we could narrow the scope and make different tickets for different AI codes?

ShadesOT commented 6 years ago

;-D

devolution79 commented 6 years ago

This is a big topic and I have a fairly comprehensive list of things to improve, what we could borrow / steal from other mods (K-Mod, C2C) and algorithms that could be entirely rewritten. I also have a private build that used Intel TBB (A task parallelization library) which is similar to Java's fork-join and I parallelized 4 or 5 subsystems and achieved pretty decent performance gains.

Since the scope is so big, I think we need to find a way to discuss them in a more structured matter so that we don't get overwhelmed. I know that @Nightinggale is a "speed freak" and loves optimizing things as well, so I'm sure he's brimming with ideas too :D

Nightinggale commented 6 years ago

I have implemented the most important idea regarding optimization, which is to allow the Makefile to make a profiling DLL. This allows profiling each line to identify which parts of the code slows down the game the most. It might seem obvious if you know about profiling, but when I started modding, nobody had managed to make a DLL with proper profiling support.

The first speed improvement I made was caching yields needed for professions. The cache is calculated when starting up and only needs to be recalculated when a Founding Father joins a team (triggers recalculation for each player on the team). My test savegame went from 40 to 33 seconds just from adding that cache. It was actually a performance issue in vanilla and no modder is to blame.

The problem with ideas with such a massive improvement is that I already worked on most of them. One thing I haven't done much about (if anything) is inline get functions. Right now there are a lot of get functions in header files and then the cpp files contains nothing but return a certain variable. I tried adding such functions as inline in the header file for GC and the resulting DLL was smaller by a significant amount (I think it was 8 kB or something. Surprisingly massive). I can only conclude that the reduced size is removing the function calls and change it to read a variable at address (pointer address + compiler set constant), though I haven't examined the compiled DLL to verify this in assembler.

This means the compiler is able to inline from the headers, but the claim that it's able to detect the simple functions and inline from cpp at link time is simply incorrect (one of the many incorrect facts posted on the forum). We can inline the simple get functions and optimize out the function calls and we will get a smaller DLL file as a bonus (though I only really care about the performance aspect, but the size is a bonus).

If we decide to go with the approach to inline, we should agree on a coding style for doing so. We should not sacrifice header file readability for this.

devolution79 commented 5 years ago

I recently spent some investigating the Rise of Mankind: A New Dawn 2 DLL. It is very similar to C2C (one of them is technically a fork of the other...)

However, while profiling I noticed how efficient the AI pathfinding seemed to be. On huge maps, even in the late game it seemed to only account for about 20% or so of the CPU usage! This is quite impressive to say the least :)

After digging a bit deeper, I noticed that Koshling (the guru programmer of C2C) had implemented a kind of cache that the unit AI could query prior to making the actual pathfinding call.

The cache I am talking about is the class CvReachablePlotSet. The way it works is that rather than looping though all the plots in a given range, and then attempting to find a path there (filtered and scored by various criteria), a cache is built by visiting all the adjacent plots in a "ring-like" fashion. The first ring is 8 plots, the next is 16 and so on (it grows just like the city's outer radius / fat cross, albeit with corners included).

It turns out that by querying this cache, a large amount of pathfinding calls can simply be eliminated \ culled. I think is reasonable to assume that merging this into WTP would lessen the pressure on the pathfinder significantly. Of course it would be even better if we also merge the K-Mod pathfinder (I am not convinced that Koshling's pathfinder is better, besides it also seems to have some bugs since I am getting PF asserts when running AND2)

We can actually do even better than this!

The current implementation of CvReachablePlotSet is not even fully optimized:

It uses recursion, but there is no reason why it should not be straight iteration. Recursion just makes it harder for the optimizer to eliminate the (stack) frame creation and it prevent proper inlining of the function body.
It stores the plots it has visited in a hashmap and in the inner loop it checks if a new plot should be added by searching the hashmap (in the profiler I can see that a binary search is performed). This is redundant since we should be able to cull plots based on the ring's X and Y coordinate cutoff.
The plot search order does not seem to be particularly cache efficient since plotDirection is used and the plots returned are not ordered in an ascending and\or pre-fetcher friendly fashion.
If the ring size is below a certain margin, it may be better to just use a std::vector or even a fixed array. The search can be implemented with SIMD (SSE integer intrinsics) which will probably demolish the hashmap in terms of performance due to cache friendliness unless the set becomes too large (and unless we write a custom allocator for the hashmap)

raystuttgart commented 1 year ago

This is outdated as we already now have Multi-Threading. Performance improvements are on the todo list anyways. Thus closed.

We-the-People-civ4col-mod / Mod

Search for optimizations in the AI processing speed #5