Over the last few weeks, we have identified a problematic and recurrent pattern where workflows which place increased CPU load on a few cores will cause stuttering and frame drops across the whole system. This has so far been observed most consistently on the AEON3 machine, with an AMD Ryzen Threadripper 3970X running on Windows 10 with Bonsai 2.7.
What we know so far:
So far we have never observed this behavior on AEON1 which runs on Intel processors.
We might have observed it under specific conditions with unrelated workflows on AEON2, but as far as we know not during actual experiments or tests. AEON2 uses the same AMD Ryzen processor, but different cameras and drivers (FLIR Spinnaker vs Basler Pylon).
It is not dependent on number of cameras.
It seems to be definitely dependent on CPU load. Even a single camera as soon as it maxes out a CPU will start showing stuttering and drops, but as soon as we remove the workload on individual cores (e.g. by disabling all alerts or other live processing such as tracking) we can run as many cameras as we have without any drops.
What we should try next:
[x] Run the updated experiments branch on AEON2. This will tell us if the stuttering is in any way related to the Bonsai 2.7 or associated OpenCV update. [DONE, ALSO STUTTERING]
[x] There have been reports of stuttering issues with Ryzen processors in high-frequency, high-load applications such as videogames. A number of patches have been put out by AMD to address these issues on Windows OS so we could try applying these to AEON3. [DONE, NOTHING CHANGED]
[x] Run AEON3 hardware and workflows on the AEON1 computer. The computer has now been moved in the room. This should give us more information to disentangle whether the issue might be Intel vs AMD or FLIR vs Basler. [DONE, STUTTERING IS GONE AS LONG AS VIDEOS ARE NOT SAVED; NOTE THAT IN RYZEN EVEN ACQUIRING WITHOUT VIDEO RECORDING GENERATES STUTTERING]
[x] Update firmware and change BIOS settings on the motherboard to remove fTPM module which has been reported to cause stuttering in systems. [DONE, MIGHT HAVE IMPROVED SOMEWHAT BUT DID NOT ELIMINATE]
[x] Change PCIe USB compatibility profiles. [DONE, NO CHANGE]
[x] Run workflows with a different Ryzen Threadripper CPU model, e.g. 3960X to see if this is an issue specific to the 3970X.
[x] Run workflow test with just cameras and video controller on the Ryzen CPU to see if stuttering persists. [DONE, NO STUTTERING]
[x] Run workflow test with everything except alerts and visualizations on the Ryzen CPU to see if stuttering persists. [DONE, NO STUTTERING]
[x] Experiment with different approaches of distributing workload across processors in AEON3. This can go all the way to using dedicated schedulers and threads or increasing granularity of parallelization to make sure load is distributed as evenly as possible.
[ ] Experiment with moving alerts and visualizations out of AEON3 acquisition and into a separate workflow running on a monitoring machine (this was in the pipeline anyway).
Over the last few weeks, we have identified a problematic and recurrent pattern where workflows which place increased CPU load on a few cores will cause stuttering and frame drops across the whole system. This has so far been observed most consistently on the AEON3 machine, with an AMD Ryzen Threadripper 3970X running on Windows 10 with Bonsai 2.7.
What we know so far:
What we should try next: