Motors: Position getting reassigned without being asked to - Reproduce this

KathrynBaker commented 7 years ago

During commissioning on ZOOM a couple of times it has been noticed (and is the case at the moment), that the 'position' of an axis has been altered.

The axes on ZOOM have not been moved, they are all at their forward limit, they should all be reading 0 (or close to 0), but MTR0204 and MTR0208 have had their position jump to -5.42 and -4.22 respectively, during a number of restarts of the IOC and config changes.

This is a potentially catastrophic behaviour for a running experiment, so time should be taken to see if this can be reproduced (ZOOM could be used for this at the moment), and then see if a resolution can be found.

This ticket should be used to see if this can be reliably reproduced, after which a second ticket can be written to see if it can be avoided, or if the resolution is obvious once reproduction of the issue is possible, then a ticket to fix it can be written.

GDH-ISIS commented 7 years ago

During commissioning, I gave the axes names in the EPICS "world" (.DESC field). For Galil_02, these names appear to have been changed and are no longer referring to the beam line motion control slits. (This may have been related to getting the globals.txt file operational on ZOOM). It would be good to appreciate why this has happened as we are going to lose our axis identifiers for driving the beam line. I believe Galil_01 and Galil_03 axis names appear to be fine.

KathrynBaker commented 7 years ago

I have so far spent 2 and a half hours actively trying to reproduce this bug, and so far I have been unable to do so, I will try again once more on another date

kjwoodsISIS commented 7 years ago

Similar problem reported on LARMOR:

From: Dalgliesh, Robert (STFC,RAL,ISIS) Sent: 14 March 2017 22:00 To: Howells, Gareth (STFC,RAL,ISIS); Akeroyd, Freddie (STFC,RAL,ISIS); Woods, Kevin (Tessella,RAL,ISIS) Cc: Washington, Adam (STFC,RAL,ISIS); Nilsen, Goran (STFC,RAL,ISIS); Stewart, Ross (STFC,RAL,ISIS) Subject: Larmor bench limits

Hi, We have been seeing some very odd problems with the Larmor rotating bench over the past couple of days.

The limit of the bench seems to not be functioning properly and keeps changing. After determining a safe limit on Monday I set this in EPICS. However, once we tried to move back to the limit it turned out that we had to add 36deg. To this number to make it work. Again today we moved to this limit and found that we had to add a further r2 degrees to the limit. The bench itself seems to be moving reproducibly but this limit behaviour is both very odd and has wasted about 4 hours of beam time to date.

Gareth seems to think that it might be a similar problem to the ones seen on IMAT.

Rob

John-Holt-Tessella commented 7 years ago

Had a very quick look through the patches in the latest release the only two which seem like they might be connected are: #1847 and #1990. @FreddieAkeroyd did we update epics base as well?

kjwoodsISIS commented 7 years ago

Reply to Rob:

From: Woods, Kevin (Tessella,RAL,ISIS) Sent: 15 March 2017 10:49 To: Dalgliesh, Robert (STFC,RAL,ISIS); Howells, Gareth (STFC,RAL,ISIS); Akeroyd, Freddie (STFC,RAL,ISIS) Cc: Washington, Adam (STFC,RAL,ISIS); Nilsen, Goran (STFC,RAL,ISIS); Stewart, Ross (STFC,RAL,ISIS) Subject: RE: Larmor bench limits

Hi Rob,

We are investigating the problem right now. We agree – this behaviour is very odd.

We’d like to try and pin-point when the problem happened – to see if we can spot anything in the logs. You say you set a safe limit on Monday. How soon afterwards did you notice the problem? Was it immediate, or was there a delay? Similarly, you moved the bench to a limit yesterday, but then had to add two extra degrees. Did you notice this problem immediately, or only after a period of working?

Kevin

John-Holt-Tessella commented 7 years ago

Extra info: The problem on ZOOM was on galil 02. It was spotted on Monday(13/3/2017) and was fine on Friday. It effected all descriptions and the position reported by the 04 and 08. Labview was watching it and it also reported a change in the reported position so it appears to be something telling the motor itself which is effected. The motors have not moved. This could be the same as the problem reported by Larmor in that if the position being reported is offset then the limits will not have been changed for the new offset.

Speculation: It may be something to do with autosave not being applied properly. There is a report in the log (on zoom ...Var\logs\ioc\GALIL_02-20170313.log) which says:

[2017-03-13 17:03:42] sevr=info *** restoring from 'C:/Instrument/Var/autosave/GALIL_02/GALIL_02_settings.sav' at initHookState 6 (before record/device init) ***

[2017-03-13 17:03:42] dbFindRecord for 'IN:ZOOM:MOT:MTR0201.DIR' failed

[2017-03-13 17:03:42] dbFindRecord for 'IN:ZOOM:MOT:MTR0201.DHLM' failed

But this seems late for a problem and I can not find a similar thing in LARMOR, although I only had a breif look.

2 other odditites:

Limit switches are not reporting correctly on ZOOM - they should all be at their limits
Autosave settings have weird EGU and are (...Var\autosave\GALIL_01\GALIL_01_settings.sav_170309-152013)
```
IN:ZOOM:MOT:MTR0101.DESC PGC
IN:ZOOM:MOT:MTR0101.EGU mmcaput IN:ZOOM
IN:ZOOM:MOT:MTR0101.RTRY 10
```

FreddieAkeroyd commented 7 years ago

@John-Holt-Tessella I would be very surprised if #1847 or #1990 caused a problem, #1847 just changed an access security group setting to stop excessive logging, #1990 defined an alternative move command which in the worse case might stop a move happening if it got set to something accidentally but would not cause a random move by itself.

FreddieAkeroyd commented 7 years ago

Can I just understand "Labview was watching it and it also reported a change in the reported position so it appears to be something telling the motor itself which is effected. The motors have not moved." The motors are reporting a change in position but no change has actually taken place?

KathrynBaker commented 7 years ago

That is it pretty much. The motor isn’t moved, but suddenly the position is redefined – which is probably a more understandable and accurate way of stating the situation. We were seeing something similar with the absolute encoders, every time the IOC was started they would ‘jump’, this was tied down to the value being set for the type of encoder, on the older Galil model 0 and 1 are the only options, but on the newer model (used for the absolute encoders) there is a third option of -1 which needs to be used for those axes. The meaning of the resolution of this setting isn’t something I can comment on off the top of my head, but I can try to explain it in some help documentation if that would be of benefit. However the motors with the positions being redefined are steppers with standard encoders, so the act of redefinition has to be coming from somewhere, and in the last week it has been noticed twice on ZOOM, both times on GALIL_02, on axes D and H most recently, and on axis G before that.

John, re the oddities:

· Drive the motor using EPICS towards the limit it sees it, but doesn’t pick them up on startup of the IOC (including restart) – this is something I noticed during my testing that I have to write up

· The settings are transferred via a generated script which is copied and pasted, I will correct the units and check the values that would have been set after them

AdrianPotter commented 7 years ago

From analysing the times at which the autosave error John identified happens, I can see at those times they are trying to save an almost empty autosave file:

# autosave R5.3 Automatically generated - DO NOT MODIFY - 170309-143803
<END>

The file has no records which is why the load fails. Since the load fails, the motor starts with its default settings. This will cause the description and position to appear to change:

The description will set to its default which will obviously be different
Although the motor won't have moved, the MRES field will default meaning the motor position is scaled to a different actual position.

I've been thus far unable to reproduce the error. I've tried restarting the IOC many times, interrupting it at multiple stages of the startup process. I've tried changing the file permissions on the auto save file, and removing the live version entirely. None of my attempts have worked.

Notably I've scanned Larmor and it doesn't appear to have the same issue. I've scanned my own machine and it looks like it happened a couple of times. The only IOCs it will affect are those that create an autosave monitor, e.g.:


# Save motor positions every 5 seconds
create_monitor_set("$(IOCNAME)_positions.req", 5, "P=$(MYPVPREFIX)MOT:,IFDMC01=$(IFDMC01),IFDMC02=$(IFDMC02),IFDMC03=$(IFDMC03),IFDMC04=$(IFDMC04),IFDMC05=$(IFDMC05),IFDMC06=$(IFDMC06),IFDMC07=$(IFDMC07),IFDMC08=$(IFDMC08),IFDMC09=$(IFDMC09),IFDMC10=$(IFDMC10)")

# Save motor settings every 30 seconds
create_monitor_set("$(IOCNAME)_settings.req", 30, "P=$(MYPVPREFIX)MOT:,IFDMC01=$(IFDMC01),IFDMC02=$(IFDMC02),IFDMC03=$(IFDMC03),IFDMC04=$(IFDMC04),IFDMC05=$(IFDMC05),IFDMC06=$(IFDMC06),IFDMC07=$(IFDMC07),IFDMC08=$(IFDMC08),IFDMC09=$(IFDMC09),IFDMC10=$(IFDMC10)")

There are only a couple of IOCs that do that and the Galil is by far the most common.

AdrianPotter commented 7 years ago

The issue can be reproduced by starting the IOC once without the GALIL_02__GALILADDR02=192.168.1.202 macro. If the problem is not in autosave, then perhaps the macro was not read correctly. This would explain those times it happened on my own machine (I often forget to make sure the macro is set before starting the IOC). It doesn't explain why it has happened on ZOOM.

KathrynBaker commented 7 years ago

Given that my work with globals.txt was to get the macro being assigned, then it does explain why it happened on ZOOM – I was in the process of getting the macros into the right place, so there would have been instances where it was likely to have been started without any macro value. There may be some subtleties within this, such as why only some of the values were reset, but I guess that could have been the last good autosave value?

No help for what LARMOR is seeing in regards to limits though – maybe that needs a ticket all its own for looking at during the shutdown.

AdrianPotter commented 7 years ago

Thanks for the info. I'll record the findings somewhere in the troubleshooting wiki. Not sure about Larmor's limits. I'll take a look at their logs but it might be a separate ticket.

AdrianPotter commented 7 years ago

I've added some troubleshooting notes here:

https://github.com/ISISComputingGroup/ibex_developers_manual/wiki/IOC-And-Device-Trouble-Shooting

kjwoodsISIS commented 7 years ago

Is there anything we can do to minimise the risk of this type of thing happening again? For example, if default values for all the macros were provided in our standard settings, that might reduce the chances of the macros being left unset (or does providing default macro values create problems of its own?). Other suggestions?

KathrynBaker commented 7 years ago

This issue was mainly down to the globals.txt being at the wrong level. We could add it into the steps for setting up a new instrument, but providing default values in there for all items might not be a wise move. At the moment, we tend to put Galil IPs in globals.txt (rightly or wrongly, these aren’t going to change with configs usually, but extras could be added that way…) and they do have a specific IP range, so we could, on the wiki in amongst any other Galil setup instructions that exist, provide those default values so that it becomes a copy and paste from there.

kjwoodsISIS commented 7 years ago

"providing default values in there for all items might not be a wise move", that may be true, but providing no value for some items is also unwise (as we have just experienced). Since putting globals.txt at the wrong level was also a contributory factor, what can we do to make it easier to put it at the right level (or harder to put it at the wrong level)?

AdrianPotter commented 7 years ago

Marking this for review. Just check my troubleshooting explanation makes sense. From a user perspective, the GALILADDR shouldn't be changed and they shouldn't be doing anything to globals.txt after the instrument is set up. This is likely only a problem the Ibex team will meet.

GDH-ISIS commented 7 years ago

May I request that this be discussed a little further. I have seen Rob on LARMOR changing globals.txt on numerous occasions in the past. I think we should be cautious with the assumption that it will remain static.

GDH-ISIS commented 7 years ago

Regarding this ticket, have we not got two problems intertwined. This ticket relates to restarting a Galil IOC and on restart, the read back of an axis being incorrect - no move being made or requested. I believe this has been seen on ZOOM and IMAT. I believe IMAT's issue has been resolved (with luck)(https://github.com/ISISComputingGroup/ControlsWork/issues/162)

(There is another ticket in IBEX relating to limit issues #2184 and another reference in the Controls work https://github.com/ISISComputingGroup/ControlsWork/issues/150 - this I believe has been seen on LARMOR and IMAT)

Tom-Willemsen commented 7 years ago

r.e. @GDH-ISIS comment above: is this ticket actually ready for review yet?

Have all the relevant discussions been had about instrument scientists changing globals.txt? If so, is the result of that discussion documented anywhere?

i.e. are instrument scientists meant to be modifying globals.txt or not? If so, how have we mitigated the risk of this issue reoccuring? If not, then how can Rob and others(?) maintain the workflows they need?

The instructions on the troubleshooting page of the dev wiki seem fine to me but I just want to confirm that it's a sufficient solution to this ticket.

Tom-Willemsen commented 7 years ago

Having just discussed this briefly with @KathrynBaker I will pull the above questions out into a seperate ticket (#2204) and mark this one as complete.

FreddieAkeroyd commented 7 years ago

I think the issue here is not globals.txt related - it is the galil IOC corrupting the rest of its otherwise valid autosave settings if it is not given an IP address. It is strange the file is nearly blank, I would have more suspected a file of incorrect values

ISISComputingGroup / IBEX

Motors: Position getting reassigned without being asked to - Reproduce this #2180