Throttling usage of callbacks for 10kHz detector

LeeHudsonDLS commented 5 years ago

@MarkRivers We have a detector (panda) producing NDArrays at 10kHz which I am trying to optimise (cpu usage is through the roof). Profiling the IOC i can see the vast majority of the cpu usage is being taken up with callbacks. It would be good to have method that calls callParamCallbacks every n miliseconds rather than for every array but also keeps track of arrays in between so as not to lose them. I was thinking of doing this in ADCore even though this callParamCallbacks is from asyn, can you think of a better way to get around this?

MarkRivers commented 5 years ago

Can you post the profile output? What are the callbacks doing?

I don’t understand what you mean by keeping track of arrays in between. What does it do with those intermediate arrays?

How large are the arrays?

thomascobb commented 5 years ago

Can you post the profile output? What are the callbacks doing?

@LeeHudsonDLS will post the output shortly. The callbacks are updating the uniqueID and timeStamp parameters on all the plugins in the chains at 10kHz, causing the db to deliver camonitor updates at 10kHz to CA clients.

I don’t understand what you mean by keeping track of arrays in between. What does it do with those intermediate arrays?

I think this can be better worded as "if we are only calling callParamCallbacks every n milliseconds, then a final NDArray comes in to update the uniqueID and timeStamp parameters without calling callParamCallbacks, how can we ensure the pvs will not show old information if nothing else calls callParamCallbacks?"

Here's an example, let's assume arrays come in at 5kHz but we only want to update every 1 ms:

UID  1 2 3 4 5 6 7                             8 9
CB?  Y N N N N Y N       Y                     Y N

With the current system, we would get a callParamCallbacks() at every UID, but that kills performance. If we throttle it to only actually call callParamCallbacks() if we haven't done so in the last millisecond we will get the updates at 1, 6 and 8, but not the one a short time after 7. This is obviously not ideal as we would spend a good amount of time with incorrect information on the screen. A possible solution is pushing the callParamCallbacks to a background thread:

# Pseudo-code that looks an awful lot like Python...
class asynNDArrayDriver:
    min_time_between_cpc = 0.001  # seconds
    need_cpc = False

    def callParamCallbacks(self):
        self.need_cpc = True

    def thread(self):
        while True:
            self.lock():
            if self.need_cpc:
                super().callParamCallbacks()
                self.need_cpc = False
            self.unlock()
            time.sleep(self.min_time_between_cpc)

This would mean that any callParamCallbacks call would now just set a flag, to be pushed to the client by a background thread with a maximum delay of min_time_between_cpc

How large are the arrays?

Tiny. maybe 20x1 Float64

LeeHudsonDLS commented 5 years ago

Profile flame chart:

This is just with the detector triggering at 10kHz with the HDF5 plugin enabled, it's not doing any file writing.

LeeHudsonDLS commented 5 years ago

Here's the perf file too: perf26.tar.gz

MarkRivers commented 5 years ago

I am attempting to reproduce the setup with the simDetector. I have configured the simDetector with 20x1 Float64 arrays. Exposure time is 0.00003 seconds, resulting in a frame rate of about 10,500 frames/s.

The only plugin I have enabled is HDF5:

"top" shows the the entire simDetector process is using 112% of a single core.

"top-H" shows this CPU usage is approximately evenly split between the SimDet task (which cannot be reduced), cbLow (processing the records), CAS-event (sending Channel Access monitors), and FileHDF51_Plugin processing callbacks.

The measurements above were with MinCallbackTime for the HDF5 plugin set to 0. If I increase this to 0.001 seconds then the simDetectorApp CPU usage falls to 68% of a core, or about 60% of the previous value. This is a reduction of about 0.5 cores.

What kind of computer is running the Panda IOC? Is it a modern multi-core CPU? If so then 112% of a core does not seem like too much. Or is it running on an embedded low-power CPU in the Panda box?

Can you send the output of "top" and "top -H" on your system?

This would mean that any callParamCallbacks call would now just set a flag, to be pushed to the client by a background thread with a maximum delay of min_time_between_cpc

That looks dangerous to me. One of the parameters that will be affected is the Acquire busy record. We must do callbacks on every transition of that parameter, which is not guaranteed with the scheme you proposed.

This scheme also assumes that it is the callParamCallbacks that is imposing the CPU overhead. It could also be the plugin code itself. It seems to me that a cleaner solution is to push this down to your driver. Rather than always calling doCallbacksGenericPointer you do something similar to the code in NDPluginScatter::doNDArrayCallbacks().

We modify NDPluginDriver::driverCallbacks so that if the NDArray is not queued for processing because the MinCallbackTime has not been reached this status information is returned in pasynUser->auxStatus.
If pasynUser->auxStatus says the NDArray was not queued your driver starts a watchdog timer for the MaximumTimeBetweenCallbacks, for example 0.1 second. It also stores a pointer to that NDArray so it can be processed later. This is probably already done in this->pArrays[0].
If the driver successfully queues an NDArray before the watchdog expires it cancels the timer
The watchdog timer callback routine does doCallbacksGeneric pointer on the pArrays[0] array, guaranteeing that the array does get processed so OPI screens are never more than 0.1 second stale.

thomascobb commented 5 years ago

What kind of computer is running the Panda IOC? Is it a modern multi-core CPU? If so then 112% of a core does not seem like too much. Or is it running on an embedded low-power CPU in the Panda box?

It is a modern multi-core CPU. I believe we get a similar CPU usage to you, @LeeHudsonDLS could you list the top and top -H outputs for our cutdown example please?

The reason we are looking at this is that the PandA areaDetector driver was originally very inefficient. Lee has been improving it so now it will run at 10kHz, but we were suprised at how much CPU was being used keeping the UID and timeStamp PVs updated in plugins. We cut back the example to a single HDF writer that wasn't capturing, and that is what we profiled. We will now go back to our full plugin chain and check again how much CPU the whole application takes while writing all the frames to disk, and profile the whole thing as well.

That looks dangerous to me. One of the parameters that will be affected is the Acquire busy record. We must do callbacks on every transition of that parameter, which is not guaranteed with the scheme you proposed.

That's good to know, we'll keep that in mind when we continue investigating.

This scheme also assumes that it is the callParamCallbacks that is imposing the CPU overhead. It could also be the plugin code itself.

Also possible, we will test with a larger example

It seems to me that a cleaner solution is to push this down to your driver. Rather than always calling doCallbacksGenericPointer you do something similar to the code in NDPluginScatter::doNDArrayCallbacks().

I'm not sure I understand. When we press start on the HDFWriter, it needs to write all the frames it receives from that point to disk. If we set MinCallbackTime doesn't that mean it won't get all the frames? I was hoping we could find some way of letting processCallbacks continue at the maximum rate, but throttle the updates to the PVs that were feeding the GUI so they weren't transmitting unnecessarily fast updates...

MarkRivers commented 5 years ago

I'm not sure I understand. When we press start on the HDFWriter, it needs to write all the frames it receives from that point to disk. If we set MinCallbackTime doesn't that mean it won't get all the frames? OK, so you can't throttle in the driver.

Would another another possibility be to bundle the data in the driver? Instead of doing callbacks with a 20x1 array, bundle N frames and do callbacks with a 20xN array. This reduces the callback rate for most PVs by a factor of N. You might need a timeout so at the end of a scan it does callbacks with padding if N arrays ave not yet been received.

I was hoping we could find some way of letting processCallbacks continue at the maximum rate, but throttle the updates to the PVs that were feeding the GUI so they weren't transmitting unnecessarily fast updates...

The problem is that those callbacks are not just for GUIs, they are for clients like GDA, ImageJ, spec, sscan, etc. which may not perform correctly if they miss updates.

Another possibility is to change the SCAN field of the non-critical PVs that are updating really fast from I/O Intr to .1 second. That should reduce the load a lot.

thomascobb commented 5 years ago

Would another another possibility be to bundle the data in the driver? Instead of doing callbacks with a 20x1 array, bundle N frames and do callbacks with a 20xN array.

I considered this, but communicating the padding of fields to whatever is consuming the data downstream is tricky. I suppose I could wrap it in a VDS to re-extract the data into the correct structure again. I'll think about this some more.

Another possibility is to change the SCAN field of the non-critical PVs that are updating really fast from I/O Intr to .1 second. That should reduce the load a lot.

That's a nice small change that we can retrofit quite easily, we'll have a go with this.

Thanks for the idea!

LeeHudsonDLS commented 5 years ago

@LeeHudsonDLS could you list the top and top -H outputs for our cutdown example please?

Probably abit late now. example is the process of interest.

Screenshot from 2019-07-11 14-45-04

MarkRivers commented 4 years ago

Clossed via #426

areaDetector / ADCore

Throttling usage of callbacks for 10kHz detector #416