Reflectometry: Scan performance

ThomasLohnert commented 5 years ago

As a reflectometry scientist, I would like scanning to run at least as fast as in SECI. Currently IBEX seems to take much longer due to the pauses between individual data points (where it stops data collection, moves the motor, changes period) being much longer (5-10 seconds instead of 1-2).

I'm not sure what the reason is (DAE behaving differently in IBEX? specifics of the implementation of the scans library?). In any case, fixing this is a priority according to reflectometry scientists, as this makes the already quite tedious scanning process much slower.

kjwoodsISIS commented 5 years ago

I think someone needs to inspect the code in SECI to see what it does and why it might be faster. Also, we need to compare the steps in SECI and IBEX, to see if any steps differ in time.

What if you don't do any data collection? (i.e. move to the next point in the scan, stop (just for a moment), and then move to the next point). How does that compare between SECI & IBEX?

ThomasLohnert commented 5 years ago

The long waits could be due to the duration of moves. See whether https://github.com/ISISComputingGroup/IBEX/issues/4541 impacts the performance

ThomasLohnert commented 5 years ago

Summary of my findings so far. The results presented here are collected using the python cProfile plugin with the command:

cProfile('scan("HEIGHT", -0.8, 0.8, count=3, frames=100'))

using the scan implementation from SURF. I have measured the individual components in cases where it was not obvious by factoring out the relevant bits of code into their own methods and looking at the cumulative time (including sub-method calls) of those methods. (the overhead this adds is negligible)

Component	Duration (s)	Duration (%)
Total	64.6	100
detector	44.7	69.3
iter	7.6	11.8
enter	9.9	15.3
misc	2.3	3.6

detector() is essentially the per-instrument implementation of the individual scan steps which processes and returns the data to plot
__iter__() is the code from the scans library that coordinates motion (i.e. calculating and moving to the points at which to take data)

__enter__() is the context manager bit, which is how the detector gets instantiated. i.e:

with open(self.defaults.log_file(), "w") as logfile, \
   detector(self, save=save, **kwargs) as detect:
...

Don't think there is much we can do about the context manager bit, but it is only called once per scan i.e. the duration does not scale up with more iterations so it should not be a huge factor - in this example I used 3 iterations but in practice the standard is more like 11-21.

Here is the breakdown for the other two major components:

Component	Duration (s)	Duration (%)
Total	7.8	100
waitfor_move	3.3	42.4
action	2.3	29.4
create_dictionary	2.2	28.17

Component	Duration (s)	Duration (%)
Total	44.7	100
waitfor_frames	5.7	12.8
get_spectrum	16.4	36.6
waitfor_move	6.5	14.6
waitfor_time	6.0	13.4
resume	2.9	6.6
pause	2.6	5.9
misc	3.2	7.3

(summed for 3 iterations)

Here is where I can see improvements being made:

g.get_spectrum() currently takes 500ms per call as it contains 4 get_pv calls (~120 ms each). I'm not sure how, but it would be nice if we could reduce this. Note that # bins of the spectra makes no difference (I profiled it on different instruments with a range of 100-6000 bins, but it was always consistent arounnd 500ms)
our detector() implementation contains a retry loop with 5 iterations for when 0-spectrums are returned (i.e. one of monitor and detector spectra does not contain any data). I believe this was implemented because get_spectrum() sometimes would not return the correct data on the first call. We should check whether this is still the case. If so, find out why and if possible, fix it. This is where by far the largest gain can be made. I think we may need to test this on a real instrument with neutrons. This may be mostly a problem with a simulated DAE, as it would always return a zero-spectrum.
waitfor_move in detector() seems to be completely redundant, as this is already done in the __iter__ code. A call to waitfor_move with idle motors still takes 2 seconds so removing this should be an easy gain
There are a couple of arbitrary waitfor_time(1) statements which I suspect were put in as safeguards. We should check whether these are actually necessary.

One bit I have not yet looked at in detail is the duration of motion in between scans. If this turns out to be a major factor in practice, the only thing I can think of to improve it would be tweaking the backlash distances /speed. This would have to be handled at the motor settings level and is a decision the scientists would have to make per axis (i.e. it's a tradeoff between speed and precision/reliability)

ThomasLohnert commented 5 years ago

I have implemented changes 2-4 from above, i.e. I have thrown out the redundant waitfor_move, the arbitrary waitfor_time and the zero-spectrum loop as I could not recreate the bug where get_spectrum does not return data. I think we should optimistically run like this next time and just keep an eye on stability.

I have also looked at g.get_spectrum, which contains 4 caget instructions for a given spectrum/period: 2 for x and y data, and 2 for x and y length usd for trimming it (since the data in x and y data PVs is fixed length). The only way I see to make this faster is to get only x or y length and use it to trim both x and y data, which assumes they are always the same length. This is the reality for the instruments I have checked, but not sure whether this is actually a general rule.

Here is a comparison of the performance for scan("HEIGHT", -0.8, 0.8, count=11, frames=100) (a typical scan command)

SECI is duration in SECI
IBEX curr is IBEX as we used it for testing on SURF (before these changes)
IBEX trim is with the changes above
IBEX trim no backlash is as above but with backlash distance of the scanned axis set to zero (instead of 0.05). MC said that longer scans due to slower backlash moves would be acceptable but I wanted to check how much of a factor this is.

While these changes have greatly improved scans performance, it is still lagging far behind SECI at almost twice the duration even without backlashing.

For the best IBEX case using the above command, the duration breaks down as follows:

Action	Duration total	Duration %	Comments
total	90.7	100
waitfor_frames	18.6	20.5	This can not be reduced further
pause	15.5	17.0	Time/ call IBEX: ~1.4s / SECI: ~0.7s
resume	10.3	11.3	Time/ call IBEX: ~0.9s / SECI: ~0.15s
get_spectrum	10.0	11.1
waitfor_move	9.0	10.0	This can not be reduced further
begin	7.3	8.1	Time/call in SECI : ~2.8s
Scan.action	4.4	4.8	Checks whether a given motion step is possible
Scan.create_dict	3.8	4.2	Updates list of outstanding motion steps of this scan
Scan.plot	2.3	2.5
change_nperiods	1.3	1.4
get_period	1.1	1.3	This is done parallel to counting and should not actually impact performance
get_frames	1.1	1.3
misc	~6	~6.5

(SECI timing information taken from genie log for scans we did on SURF last cycle)

It's worth investigating whether we can make begin, pause, resume any quicker in genie_python since there is quite a difference between SECI and IBEX. Other than that, there are possibly some minor gains to be made in the Scans code, but this is hitting a point of diminishing returns.

It seems to me the thing that generally makes everything slower compared to SECI is the reliance on Channel Access requests. We do have a couple of other options to speed scans up, namely

Make more use of monitors keeping values updated passively, rather than issuing x cagets each loop
Decouple data acquisition and processing/plotting and do it in parallel

These things may require larger changes to the scans library. I will discuss this with the main Scans developer if we cannot improve the performance to a level that is acceptable for the reflectometry scientists by other means.

I should also mention that scans in IBEX seem to stall sometimes seemingly at random for lengths in the region of 10s at a time. I have not yet been able to figure out the cause of this.

kjwoodsISIS commented 5 years ago

Thanks @ThomasLohnert . This is a good investigation and you have identified a number of things we can do to get better performance. I agree, we should look at begin, pause and resume to see if we can make more gains. Is there any more we can do to get_spectrum?

How do waitfor_frames and waitfor_move compare to their SECI equivalents?

FreddieAkeroyd commented 5 years ago

If dist=True then x and y are same length, if dist=False then x will be one point bigger (as it is histogram bin boundaries). However dist seems to be ignored and I proposed a fix in
https://github.com/ISISComputingGroup/IBEX/issues/4014 So it is unnecessary to get both X and Y array sizes if you know what dist should be (and bearing in mind the broken behaviour mentioned in 4014). Channel access does support variable array sizes, but i think not all clients are happy with this so we would need to confirm it was Ok everywhere.

FreddieAkeroyd commented 5 years ago

The ibex begin does more than a seci begin, it resyncs run control and also restarts (and checks has restarted) the archive engine. These operations could be moved into separate DB monitoring logic and so run independently after a begin has started

kjwoodsISIS commented 5 years ago

Given @FreddieAkeroyd 's previous point, I think we should do two things:

make sure the scientists know that an IBEX begin does more than a SECI begin
- make sure they know we are considering ways to do these extra things in parallel.
create a separate ticket to move the additional logic as Freddie suggests

FreddieAkeroyd commented 5 years ago

It looks like ibex is 90s when seci is about 55s. Lost time of 45s seem to be

7 seconds pause 7 seconds resume 4 seconds begin

how many lost for get_spectrum? at most 9. Can you time the "s1=GET(1)" command in Open GENIE. That still leaves 20s to account for.

ThomasLohnert commented 5 years ago

I'm not sure waitfor_frames and waitfor_move can be made faster as they are waiting for external events (neutrons and motion respectively). I tried reducing the sleep time in the waitfor_loop (i.e. how often does it check whether the waitfor condition has been met) from 0.3s to 0.01s, but it did not seem to make a noticeable difference.

s1=GET(1) takes around 50 ms (vs 500 ms for get_spectrum).

I added a comment for reducing get_pv calls on https://github.com/ISISComputingGroup/IBEX/issues/4014

Ticket for speeding up DAE created here: https://github.com/ISISComputingGroup/IBEX/issues/4666

I will bring up the topic of scan performance with the scientists. The changes implemented & proposed above might get us close enough for it to be accepted. If not, I will talk to AW about ways of speeding up the scans code. After that I think it would get very fiddly to try and wring more speed out.

FreddieAkeroyd commented 5 years ago

The size of the X array or Y array will not change for a given spectrum during a run; also the xarray will not change for a given spectrum during a run. So in principle it would be possible to cache the xarray (and its size) and only re-read the y array during a scan

FreddieAkeroyd commented 5 years ago

Does ISISComputingGroup/genie_python#169 help?

FreddieAkeroyd commented 5 years ago

Thought of something else that is making pause / resume slower. At the end of any DAE command I make sure the PVs for information (frames, microamps etc.) are correct for that state, this is in case anybody read these pvs immediately afterwards. As i don't know what what people might read, I have to update them all. SECI didn't need to do this, genie made a direct call to the isisicp if it needed a value so all a pause had to do was pause data collection. The problem is isisdae ioc and isisicp being separate processes, they need to be kept in sync. Possible solutions:

i currently re-read the whole dashboard as i already had that function, however only a subset of values should change so that update could be optimised.
the isisicp has all its values in shared memory. Currently (and relatively recently) i only map to the histograms so i can do live view without having to do lots of copying. If i mapped to whole memory area the update would be a lot quicker. However this couples isisdae and isisicp much more tightly.

ThomasLohnert commented 5 years ago

https://github.com/ISISComputingGroup/genie_python/pull/169 makes quite a difference. It looks like IBEX might actually beat SECI for longer scans?

I will profile SECI against different IBEX conditions (pre changes, post changes, no backlash) for several scans with different parameters.

kjwoodsISIS commented 5 years ago

Excellent news! I'll be very interested to see the new profiles. And we should make sure the scientists know of these performance gains.

ThomasLohnert commented 5 years ago

Sorry, I spoke too soon. The scans in SECI I compared against from earlier did not run properly which I believe was a result of starting OpenGENIE before SECI. It is still slightly faster but we are getting closer:

Condition	SECI	IBEX pre	IBEX post	IBEX post no blash
11 steps, 100 frames	1151.534	154.061	70.633	62.124
11 steps, 500 frames	148.191	233.086	163.028	159.06
41 steps, 100 frames	173.375	527.913	230.238	233.903
41 steps, 500 frames	533.394	820.895	574.434	560.031
Average all scans	226.6235	433.98875	259.58325	253.7795

Note that IBEX PRE is run under the optimistic assumption that get_spectrum never returns zero spectra for either detector or monitor, otherwise each failed attempt adds 1 second. The reality of this is not testable without neutrons.

The main difference between the modified IBEX and SECI scans seems to be in the duration of pause / resume which accounts for about an extra 0.5 - 0.6s per iteration.

If we can reduce this as per Freddie's suggestions above I imagine this will bring the performance difference to an acceptable level (if it is not already acceptable as is).

The other thing I still need to do is to find a proper solution for caching the spectra size across scan steps (I have been testing it with a hacky bit of code). This will likely require changes in the scans library so I will talk to AW about this when he is back. However, this now has lower impact on the time with the above changes making get_pv faster.

FreddieAkeroyd commented 5 years ago

Excellent news - the time for short scans has almost halved! I think James Lord would be interested in the python fix too as he does a lot of PV setting during pre-cycle tuning.

We could investigate dynamic array sizes at some point, in theory you should never need to read the array size separately but we had issues (gateway maybe?) in the past when a size changed. I think we may understand this now and be able to fix.

Some other thoughts:

Do you use cget / cset a lot in the code? Genie_python makes repeated calls to the blockserver to get block names via the get_blocks() function, the BLOCKNAMES pv should be monitored separately in the background to remove this load. To see if this is significant you could temporarily make get_blocks() return a hard coded list of block names. This would help cset, but cget reads all block values from the blockcache which may slow it down and is a bigger change to fix.
The waitfor command (so waiting for frames) has a 0.5 second delay in the loop, i think Open GENIE has 0.2 seconds in its equivalent loop, so for 10 scan steps you could lose 3 seconds
wait_for_runstate has a 0.3 second delay in its loop, not sure how often it is called
Do you or the scans library call wait_for_pv() ? If so it looks like this has a 0.5 second poll, try reducing this to 0.1 or 0.2 Using a underlying monitor would be most efficient, but more complex to manage.

ThomasLohnert commented 5 years ago

Do you use cget / cset a lot in the code?

Each gets called 1x per scan step. Looking back at the results it seems the average time per call for both varies quite drastically. I think cget/cset may be the source of the random stalls I described in an earlier comment - this seems important to fix (may be fixed by https://github.com/ISISComputingGroup/IBEX/issues/4338). I also tried modifying get_blocks to return a hardcoded list which gave a slight performance gain compared to the "best case" runs.

The waitfor command (so waiting for frames) has a 0.5 second delay in the loop

wait_for_runstate has a 0.3 second delay in its loop

Changing both of these to 0.1 actually makes quite a difference. (~20 seconds for 41 steps, 100 frames). I'm not sure whether this is safe to do as a general change to waitfor or whether it should be an option. Could running sleep in a tighter loop have negative performance implications in other areas?

Do you or the scans library call wait_for_pv()

No.

ThomasLohnert commented 5 years ago

Summary of investigation:

Done

In our Defaults implementation (Surf.py):

Removed arbitrary waitfor_time
Removed redundant waitfor_move
Removed retry loop around get_spectrum part
Reduced sleep duration in waitfor methods (https://github.com/ISISComputingGroup/genie_python/pull/170)

Todo

Speed up pause/resume, perhaps by reviewing and reducing the number of values that need to be kept in sync between ISISICP and ISISDAE IOC, or by moving these values to shared memory. Issue here: https://github.com/ISISComputingGroup/IBEX/issues/4676
Speed up get/set pv: review and merge https://github.com/ISISComputingGroup/genie_python/pull/169 (I have held off as I am not sure I fully understand the implications)
Speed up get_spectrum
- check whether PVs holding spectrum data can be sized dynamically, removing the need for having/reading two separate size PV. Issue here: https://github.com/ISISComputingGroup/IBEX/issues/4677
- AW is going to merge code in the scans library this week that lets us cache objects between scan steps (i.e. we may read spectrum size only once at the start of a scan)

All the code I modified temporarily for theses tests is on surf, in Instrument\Scripts\ and Apps\Python. All the files in question end in _4547.py

FreddieAkeroyd commented 5 years ago

Does the runstate change in #4686 make much difference to pause and resume times?

John-Holt-Tessella commented 5 years ago

Used https://github.com/ISISComputingGroup/genie_python/pull/171/files instead of 0.3 -> 0.1 wait in ISISComputingGroup/genie_python#170 (Advised by @FreddieAkeroyd this seems ok and tested by disabling state check in genie_python, error is caught and alerted)

John-Holt-Tessella commented 5 years ago

FreddieAkeroyd commented 5 years ago

The above 3 PRs look good, with them applied a rough calculation suggests that "IBEX post no blash" and SECI times should now be almost the same?

ISISComputingGroup / IBEX

Reflectometry: Scan performance #4547

Done

Todo