FormularSumo / Star-Wars-Galaxy-Collection

A remake of the discontinued Star Wars Force Collection game https://formularsumo.github.io/Star-Wars-Galaxy-Collection-Web/ https://play.google.com/store/apps/details?id=com.formularsumo.starwarsforcecollectionremake.embed
GNU Affero General Public License v3.0
6 stars 0 forks source link

Load some images off the main thread to reduce frame drops #81

Closed FormularSumo closed 3 months ago

FormularSumo commented 3 months ago

Currently the biggest performance issue for the game is loading lots of images at once, which even on very fast devices can cause noticeable frame drops/stuttering. Love2d allows for creating additional threads which can be used to load image data and pass this back to the main thread to then draw.

The main place where theae frame drops are noticeable is changing pages quickly in the deck editor, and to a lesser extent loading the deck editor and battles. My initial idea is to offload all character cards loading to a separate thread, and while these images have not been loaded, show a placeholder image in place of them. This image could be the existing blank card or a variation which indicates loading.

Onece the above has been implemented it'd be worth measuring performance again, and if necessary/desired, it could be worth creating multiple threads and sharing the work through a thread pool/queue. One idea is to create as many threads as the user's system has minus 1 or 2 to take maximum advantage of their hardware.

Finally, it'd be good to try extending this approach to projectiles and weapons. Place holder images probably aren't needed for these as they should always be loaded in the second before a battle starts (or few seconds before they're needed), but this can be further tested. It's also worth testing if this actually makes much difference to the overall load time, but I expect a small one.

I think backgrounds and UI elements should remain loaded on the main thread. As seen when the game initially loads, loading these has barely any impact on load times. It's only when lots and lots of images are loaded at once that it becomes a problem. Additionally, not loading these before loading a new state would mean it would be blank for a time which is probably not a good look, and a very short delay gives some time for other images to be loaded on these new threads, which should also look better/more seamless, especially on slower hardware.

Overall I think by doing this the last fixable performance issue with the game can be resolved, and card loading in general might even get noticeably faster if work can be effectively split between threads.

Other than this, drawing images has by far the biggest performance impact (according to profiler.lua), at around 5x that of game logic iirc. And image drawing can't be moved off the main thread as it needs to happen together, and can't really be optimised further (potentially a bit using canvases but I think this would be a small difference given Love2d already batches draw calls) (also by creating a low resolution version of the game but this would require a lot of reworking and 1080p is already reasonably low). So there wouldn't really be any point at the moment at least in trying to parallelise game logic or otherwise optimise it (something I have tried quite hard to do in the past), as it does already run very very quickly.

FormularSumo commented 3 months ago

Performance testing

Now that I have a working multithreaded approach I thought it worth testing properly to see the performance impact.

FPS Drops

One that is immediately noticeable is that the slight FPS drops that used to happen when quickly flicking through pages in the deck editor no longer happens. On the other hand, there is a very short delay before the images appear which causes a sort of flashing. This is why I'm considering loading all the images at once.

Load time/memory usage

Another thing I tested was creating many threads to see how this impacted loading performance. On my (fairly slow) laptop, spinning up 1600 threads cause a roughly 2-3x increase in initial load time and 5.7 MB increase in memory usage (88.5 --> 94.2). Approximating this for 4 threads (/399) would produce 0.627% increase in load time and 0.014MB increase in memory usage, which I think is perfectly acceptable. For 16 threads (highest number that most consumer devices would have) (99.75) this would be 2.5% and 0.059 MB. Slightly more, but also these would tend to be higher end devices where it doesn't matter as much.

Image decoding time

Finally I tested decoding 10000 images on different thread combinations to test if adding more threads did make this faster. 10000 if obviously overkill, the game currently wouldn't ever load more than ~550 (number of characters) currently, but it was helpful to see the impact and make testing more reliable.

Key - Threads: Seconds

I tested first on my PC (1600x, 6c/12t) (Linux AppImage) 12: 1.05 11: 1.04 9: 1.17 6: 1.53 3: 2.70 2: 4.00 1: 8.00

And just for fun (as expected, more threads than physical threads is counter-productive) 24: 1.13 48: 1.17 96: 1.22

Running on the main Love2D thread (causes the whole program to freeze for the duration, which is what this is all for solving) 7.63

And on my laptop (8100y, 2c/4t) (Windows) 4: 9.79 3: 10.21 2: 11.08 1: 15.5

8: 11.23 16: 11.60 32: 12.38

Running on main thread: 12.11

Conclusion

It seems that creating as many threads as the device has leads to the quickest decoding times. I think it's theoretically more efficient as well, as CPUs use proportionally less power to turn on/use more cores, especially when that causes maximum frequency to be lowered slightly. The only concern I have with using all the system's threads would be that it does impact the performance of the main Love2D thread, as presumably it's having to compete with one of these threads for CPU time. In my testing it doesn't seem to make a difference, clearly the Love2D thread work is prioritised above others. The game is just as responsive as normal regardless of how many threads (or not) are carrying out this large decoding task in the background.

Interestingly, on my laptop Windows task manager reports 100% CPU usage regardless of how many threads are running the task. And it doesn't seem to have the same performance scaling as my PC (which is fairly linear). Maybe the Windows version of Love2D does some kind of automatic parallelism for image decoding that the Linux version doesn't? On my PC, system resources reports CPU usage as being exactly what'd you expect: 1 thread = 1 virtual core (thread) running at maximum usage, 3 = 3, 6 = 6, 12 = 12 etc. So there's definitely something to be said for not using it all in order to avoid slowing down anything on the PC. Again, I think these extra threads created in Love2D are fairly low priority as it doesn't seem to affect anything surprisingly, but I think it might be worth sticking to the 1 less than system threads rule for now just in case, especially as the difference between that and all is quite small. I might test again later to see how it compares when it's also being used in different places and other platforms (Android/Web).

FormularSumo commented 3 months ago

Following on from the above investigation, file writing is only done when sorting decks or when swapping cards, both of which have no frame drops now (at least on my hardware, wants testing on weaker devices/web). Loading the deckeditor still causes a few frames to drop so investigating that more:

(Sandbox on) loading deckeditor Total time - 0.045 Loading background - 0.0252 Sorting - 0.00723 (+0.0003 for other loadCards code) reloadDeckeditor - 0.0023 Loading deck file and blank card - 0.002 Evolution icons - 0.0018

Disabling background loading - 0.01-0.02 seconds, no/1 frame drop. So most of the performance impact now is loading the background, which probably wouldn't look very good running off-thread. But maybe could be loaded after initial game-load, off-thread, and then just stored in the meantime. Same could be done for file loading potentially.

FormularSumo commented 3 months ago

All deck editor card images are now queued to be created straight away. Although visible cards are decoded first, then the rest. This has got rid of the flashing that could be seen while new pages in the deck editor which is great. It has however increased memory usage by about 50 MB from before (~70 --> 120) (in sandbox). This amount of memory would be used anyway once the all the pages had been viewed, but now it's used all the time. Overall an improvement. However I'm wondering if there's a middle ground of loading just a page either side of the current one (like how battles load one row offscreen in advanced), which reduces memory usage to near previous-levels, while still preventing flashing in most cases and avoiding slowing down the main thread. Potentially alongside this, images could be deleted as no longer needed (also could be done in battles). With all that said, 120 MB isn't a lot for a game and there is an advantage to keeping images loaded a least for a bit - less CPU used loading them again and again. This is why I opted for my previous approach of load as necessary and store until the state changes. Maybe I'll experiment a bit more and see what works best.

FormularSumo commented 3 months ago

Having done all this work, and with only lazy evolution images loading to be done, these are the current load time test results (average of 3 runs, on my 8100y laptop, from start of state init to end of state enter):

Auto deck vs Maxed, videos on 0.05257 --> 0.03201 = 39.1% reduction

Auto deck vs Geonosis, videos off (unreliable loading time) 0.05382 --> 0.03120 = 42.0% reduction

Deckeditor, videos on 0.0999 --> 0.0681 = 31.2% reduction

On my laptop there were very rarely frame drops loading battles (except videos, which as mentioned above, can be pretty unreliable), so this isn't really noticeable there, but it's still an improvement. And for people with slower hardware (and web where performance is a bit worse), this may well make a difference. Meanwhile, the deck editor did and still does cause some frame drops so this is a more noticeable improvement there which is nice.

FormularSumo commented 3 months ago

Performance moving large evolution images off-thread and no longer loading deck file

Laptop, loading deck editor, videos on, average of 3 runs

0.06450 --> 0.05752 = 10.8% reduction

Compared to before any multithreading-branch changes: 0.0999 --> 0.05752 = 42.4% reduction

FormularSumo commented 3 months ago

Time spent loading deckeditor now

Still on laptop, videos on, average of 3 runs, sandbox

Creating background - 0.0412 Loading/creating cards - 0.00886 (loadCards - 0.00911, reloadDeck (includes updateCardsOnDisplay) - 0.001183) Creating GUI - 0.00422 Creating small evolutions - 0.00147 Queueing image creation - 0.000902

FormularSumo commented 3 months ago

Time spent on loadCards

On PC now, videos, on, average of 3 runs, sandbox

Sorting - 0.00780 Creating P1cards table - negligible (~0.00014) Creating P1DeckList - negligible (~0.000005)

FormularSumo commented 3 months ago

Sorting having optimised characterStrength and using local variables

PC, videos on, average of 3, sandbox

0.00762 --> 0.00665 = 12.7% reduction

Extrapolating from before (as there's too much variance when measuring everything to be able to reliably measure this small a difference) 0.05752 --> 0.05652 = 2% reduction

Or overall reduction 0.0999 --> 0.05652 = 43.4% reduction

FormularSumo commented 3 months ago

Having done some more testing, not all battles cause no frame drops loading on my laptop. I believe the differentiating factor is the background, so I'm going to investigate recompressing some (especially as Sith Trimuvirate is currently bigger than all the other images combined)

Battles which drop frames loading (laptop, videos off): Sith triumvirate - 57 FPS Jedi Council Chamber (Order 66) - 59 FPS

FormularSumo commented 3 months ago

Those backgrounds have now been replaced, taking up much less space and no longer cause frame drops on those battles on my laptop

FormularSumo commented 3 months ago

Final load time results, post all off-thread rendering and other optimisations

Laptop, average of 3 runs

Auto deck vs Maxed, videos on

0.05257 --> 0.03201 --> 0.0307 = 4.3% reduction vs last testing, or 41.6% overall

Auto deck vs Geonosis, videos off (unreliable loading time) 0.05382 --> 0.03120 --> 0.0257 = 17.6% reduction vs last testing, or 52.3% overall ^This result is skewed slightly by the background image now being smaller, but given this is the case for 2 battles (very noticeably on sith triumvirate) and another 4 if videos are off - plus main game loading, I think this is good to include still.

Deckeditor, videos on 0.0999 --> 0.0681 --> 0.0570 = 16.3% reduction vs here, or 42.9% overall

With all that said, I believe I've done everything I want to do for this issue, so the only remaining step is to merge it all into master and then close this issue.

Update

Have had to completely disable for Web build as threads don't seem to be working there. And for Android native having more than 1 extra thread seems to increase loads times at least on slower devices, so it I've set it to create just 1 there (could be due to efficiency cores I guess?)

FPS drop (time taken) loading Deck editor / Geonosis (maxed deck)

Main thread Chromebook: 50/49 P4a: 54/55 P7: 79/80

Main thread + total number of threads minus 1 Chromebook: 43/50 P4a: 42/53 P7: 80/82

Main thread + 1 extra Chromebook: 52/50 P4a: 56/55 P7: 83/78