Performance Review - Githubissues

harshil21 commented 1 day ago

Performance is a critical part of our system. The code needs to do things fast enough to be able to predict apogee, and thus, deploy airbrakes in time, all while running on a tiny computer (the Raspberry Pi). While Python isn't known for it's speed, we do use it because of it's ease of use and massive community and library support.

One of the only ways we can speed up our program is by using multiprocessing - but this does add a lot of overhead, particularly with Queues. This is most noticeable when you run the mock sim with -f (and if you profile the code to check the bottlenecks). In the earlier days, running even a real time simulation on the Pi would result in fetching about 15 packets at a time, which was indicative of performance issues. However after many optimizations across the board (e.g. #40, #45, #47, #69, #81, #82). This is basically down to 1-3 packets. Which means we have fully solved the performance problem! Yay!

Or did we?

Running the mock sim at full speed still takes about 10 seconds for every launch. An eternity. This is still unacceptable for me.

So after investigating, I found out the producer side (MockIMU), simply takes up 6.5 seconds, just to read the file and create the data packets (and not sending them!). That was brought down to 1.73 seconds by using pandas (#82). Simply using pandas and mimicking what was done before does not give you any gains, you have to use it in a thoughtful and intentional way, profiling every step of your process. So now, the theoretical max speed our program can run should be 1.73 seconds (on my computer at least). But we don't get anywhere close to that, because of... Queues.

I would like to propose using the faster-fifo library for this purpose. Initial benchmarks show a massive improvement, finishing the purple launch sim in a staggering time under 5 seconds.

The only disadvantage of using that would be that it does not work on Windows. I'm not even sure if it works on WSL. That being said, this would be the fork in the road for having Windows support, at least in the short term. Windows support is already not great, evidenced by the tests failing due to multiprocessing semantics. Alternatively, we can also keep backward compatibility, so Windows would continue working as before, while Unix/MacOS can enjoy the benefits more.

Long term, with Python 3.13 free threaded, we can drop multiprocessing altogether, simplifying a bunch of things as well, and that should theoretically gives us the same performance as before. We are simply waiting for msgspec to add support for it - https://github.com/jcrist/msgspec/issues/744.

MrPezser commented 1 day ago

Is the faster-fifo library optimization just for mocking the IMU data? If so, then is losing windows support worth it for something that is may or may not be used for an actual launch?

Also, do we have an up-to-date profile of the code running on the pi, mocks or not? It would be nice if we could find a way to algorithmically optimize the queue actions instead/in addition.

harshil21 commented 13 hours ago

Is the faster-fifo library optimization just for mocking the IMU data? If so, then is losing windows support worth it for something that is may or may not be used for an actual launch?

The most performance to be gained is in fetching IMU Data packets from the (Mock)IMU, but we can also apply it to the other Queue's being used. My first instinct is to keep Windows support also. This should be easy enough with no additional overhead (i.e. we should not keep checking every iteration if we are using windows or not) if designed properly. Only thing is that it would make the code slightly less readable.

Also, do we have an up-to-date profile of the code running on the pi, mocks or not? It would be nice if we could find a way to algorithmically optimize the queue actions instead/in addition.

I have not specifically profiled the code on the pi's hardware, yet, no. But an easy optimization for the queue we can do regardless of faster-fifo is to send and receive a fixed amount of packets, instead of sending them and retrieving them one by one. This is theoretically faster, since the processes acquires a Lock() only once every x packets, instead of doing it for every packet.

This is something we already do for the apogee prediction process. We didn't update the code for actual IMU yet because we would need to test it thoroughly, which we didn't have a chance to do.

NCSU-High-Powered-Rocketry-Club / AirbrakesV2

Performance Review #85