Phoenix pro java StatusSignalValue refresh method has random runtime spikes

mjansen4857 commented 1 year ago

We were recently trying to diagnose some major code runtime issues, our main robot loop would almost consistently run in the 30-50ms range, which is obviously pretty bad for a target runtime of <20ms. We were able to narrow the issue down to the refresh method for StatusSignalValue objects. According to the docs, these methods are supposed to be non blocking and we measured them to typically take <0.1ms, but we also measured random runtime spikes in these methods of up to 20ms. With dozens of status signals being refreshed every robot loop, this led to these spikes happening nearly every loop somewhere in the code.

We replaced every usage of a cached StatusSignalValue object to just using the get methods, i.e. driveMotor.getPosition().getValue() and our main loop runtime is now consistently between 10-15ms.

This is fairly easy to reproduce (at least for us) by just measuring the run time of a block of refresh methods. We have only tried this with multiple refreshes immediately after one another, so not sure if it would happen with just one. For example on a swerve module:

double startTime = Timer.getFPGATimestamp();

drivePositionSignal.refresh();
driveVelocitySignal.refresh();
rotationPositionSignal.refresh();
rotationVelocitySignal.refresh();
rotationEncoderAbsPositionSignal.refresh();
rotationEncoderVelocitySignal.refresh();

double runtimeMS = (Timer.getFPGATimestamp() - startTime) * 1000;
SmartDashboard.putNumber("RefreshRuntime", runtimeMS);

We have verified that these spikes are not false measurements from the fpga timestamp, timing the runtime of everything but the refresh methods produces consistent, accurate measurements.

CoryNessCTR commented 1 year ago

We replaced every usage of a cached StatusSignalValue object to just using the get methods, i.e. driveMotor.getPosition().getValue() and our main loop runtime is now consistently between 10-15ms.

This is really odd, as the get*() methods call refresh() under the hood. If we were to take your code snippet and switch from refresh() to get*() would you expect the timing to get better?

The refresh() routine is designed to be nonblocking, but we did have another customer run into an issue where occasionally calls were slow (though not as slow as 20ms).

We recommended switching to using waitForAll() with 0 timeout and their results were much better. If you have lots of refresh calls, switch to using waitForAll(), as it reduces the overall call time and is generally the best method for asynchronously polling in terms of performance. The more signals you can move to the single waitForAll(), the better it'll be.

Make sure you're on the latest version of API (23.0.10) when you switch over, as we improved waitForAll() to better handle 0 timeout. https://api.ctr-electronics.com/changelog#20230301

Let us know if you've got any other questions!

mjansen4857 commented 1 year ago

If we were to take your code snippet and switch from refresh() to get*() would you expect the timing to get better?

Yes, we actually have more calls to the get methods per loop than we do to refresh methods, so I would expect the opposite, but the get methods don't have these random spikes for some reason, but the refresh ones do. The timing probably takes more when there are no refresh time spikes, since if I understand it correctly the get methods are blocking correct? But, since we have enough refresh calls to get timing spikes pretty often, it ends up being quicker most of the time.

We will give the waitForAll method a try tonight and report back. Thanks for the suggestion.

mjansen4857 commented 1 year ago

Do you mean waitForUpdate instead of waitForAll? I can't seem to find a waitForAll method.

EDIT: NVM, I see its a static method now.

CoryNessCTR commented 1 year ago

since if I understand it correctly the get methods are blocking correct?

The get*() methods are meant to be nonblocking as well (through the refresh call under the hood). I'm making an internal tracker to look into this more, since it's the opposite of what I expect.

we actually have more calls to the get methods per loop than we do to refresh methods

I'd try to cache all the StatusSignalValues so there's not any get*() methods getting called periodically if it's possible when performing your test. As I said, the more objects you can pass into the single call the better the performance should be.

Let us know how your results turn out!

mjansen4857 commented 1 year ago

So the waitForAll method is definitely better, typically running <0.2 ms. But, it still spikes up to 3ms sometimes. So, definitely usable now, but the spikes are still less than optimal. Pretty rare issue now since it’s only a few waitForAll calls instead of dozens refresh calls.

CoryNessCTR commented 1 year ago

This should be resolved in the latest beta release of Phoenix 6. Please re-open if you experience this issue again.

CrossTheRoadElec / Phoenix-Releases

Phoenix pro java StatusSignalValue refresh method has random runtime spikes #44