epics-extensions / ca-gateway

Channel Access PV Gateway
http://www.aps.anl.gov/epics/extensions/gateway/
Other
19 stars 18 forks source link

misconfigured softioc causes gateway crash #20

Closed goetzpf closed 5 years ago

goetzpf commented 5 years ago

A softioc with a waveform record where EPICS_CA_MAX_ARRAY_BYTES is not set, causes the gateway to crash with "PV Gateway Aborting (SIGSEGV)". The bug was seen with EPICS base 3.15.5 and CA-gateway R2-1-0-0. It seems to happen when the gateway is started with option "-no_cache".

I have attached a recipe that shows how to reproduce the problem: Unpack the tar file and follow the instructions in README.rst. You need to have pyepics or edm installed to produce this error.

GATEWAY-ERROR.tar.gz

aawdls commented 5 years ago

We have also observed this issue at Diamond with ca-gateway R2-1-0-0 and base R3.14.12.7. I had a little go with the test recipe above. We have not seen the issue until recently when we updated from R2-0-4-0, so I tested the recipe with the gateway built at different points between these two releases. It looks like the commit c6dc1597d991e may be related, since if I run the recipe having built with the previous commit, I do not observe the crash, whereas I do see it from this point forwards. If I start the gateway with the debug level at 50, and I do:

caget -d 34 gptest

I get the following output from the gateway:

$ ./gw-start.sh 
starting gateway with args -sip 172.23.244.99  -cip 172.23.244.99  -sport 6264 -cport 6164 -no_cache -debug 50

option dump:
 home = </scratch/GATEWAY-ERROR/GW>
 log file = <<terminal>>
 access file = <NULL>
 list file = <NULL>
 command file = <NULL>
 putlog file = <NULL>
 report file = <gateway.report>
 debug level = 50
 connect timeout = 1
 disconnect timeout = 7200
 reconnect inhibit time = 300
 inactive timeout = 7200
 dead timeout = 120
 event mask = va
 caching = disabled
 archive monitor = disabled
 user id= 1207735
 group id= 1207735
gateway setting <EPICS_CA_AUTO_ADDR_LIST=NO>
gateway setting <EPICS_CA_ADDR_LIST=172.23.244.99>
gateway setting <EPICS_CAS_INTF_ADDR_LIST=172.23.244.99>
gateway setting <EPICS_CA_SERVER_PORT=6164>
gateway setting <EPICS_CAS_SERVER_PORT=6264>
Apr 23 16:35:44 PV Gateway Version 2.2.0-DEV [Apr 23 2019 16:30:40]
EPICS 3.14.12.7 PID=13715
EPICS_CA_ADDR_LIST=172.23.244.99
EPICS_CA_AUTO_ADDR_LIST=NO
EPICS_CA_SERVER_PORT=6164
EPICS_CA_MAX_ARRAY_BYTES=6000000
EPICS_CAS_INTF_ADDR_LIST=172.23.244.99
EPICS_CAS_SERVER_PORT=6264
EPICS_CAS_IGNORE_ADDR_LIST=Not specified
Running as user tdq39642 on host pc0095.cs.diamond.ac.uk
gateServer()
Statistics PV prefix is pc0095.cs.diamond.ac.uk
Apr 23 16:35:44 gateAsCa: Invalid access security
gateServer::pvExistTest(ctx=0x1369b60,pv=gptest)
gateServer::pvExistTest() gptest real name gptest
gateServer::pvExistTest() gptest creating new gatePv
gatePvData(gateServer=0x13659b0,name=gptest)
gatePvData::init(gateServer=0x13659b0,name=gptest)
gatePvData::init entry pattern=.*)
gateServer::pvExistTest() gptest connecting (new async ET)
gatePvData::accessCB(gatePvData=0x137f990)
accCB: -------------------------------
accCB: name=gptest
accCB: type=-1
accCB: number of elements=0
accCB: host name=pc0095.cs.diamond.ac.uk:6164
accCB: read access=1
accCB: write access=1
accCB: state=0
gatePvData::connectCB(gatePvData=0x137f990)
conCB: -------------------------------
conCB: name=gptest
conCB: type=6
conCB: number of elements=2048
conCB: host name=pc0095.cs.diamond.ac.uk:6164
conCB: read access=1
conCB: write access=1
conCB: state=2
gatePvData::connectCB() connection ok
gatePvData::life() name=gptest
gatePvData::life() connecting PV
gatePvData::flushAsyncETQueue() name=gptest
gatePvData::flushAsyncETQueue() posting 0x13a1730
~gateAsyncE()
gateServer::pvExistTest(ctx=0x1369b60,pv=gptest)
gateServer::pvExistTest() gptest real name gptest
gateServer::pvExistTest() gptest exists (inactive)
gateServer::pvAttach() PV gptest
gateVcData(gateServer=0x13659b0,name=gptest)
gatePvData::activate(gateVcData=0x13a1b10) name=gptest
gatePvData::activate() inactive PV
gateVcData::postAccessRights() posting access rights
gateVcData::postAccessRights() posting access rights
gateVcData::vcAdd() name=gptest
gateVcData::vcAdd() connecting -> ready
gateVcData::vcNew() name=gptest
gateVcData::bestExternalType()
gateVcData::createChannel()
gateVcData::maxDimension() gptest 1
gateVcData::maxBound(0) gptest 2048
gateVcData::read() name=gptest
gatePvData::get() name=gptest
gatePvData::get() active PV
gatePvData::get() NO_CACHE doing ca_array_get_callback of type CTRL (34)
gatePvData::getCB(gatePvData=0x137f990)
gateVcData::vcData() name=gptest
gateVcData::flushAsyncReadQueue() name=gptest
gateVcData::flushAsyncReadQueue() (ctrl read) posting asyncr 0x13a42d0 (DD at 0x13a3f20)
Apr 23 16:35:47 PV Gateway Aborting (SIGSEGV)
./gw-start.sh: line 10: 13715 Aborted                 ../bin/cagateway $ARGS

I think it is also related to using CTRL types. For example, a command line caget which specifies a non-CTRL type does not cause the crash, but specifying a CTRL type does.

caget gptest # No crash
caget -d 17 gptest # No crash
caget -d 34 gptest # Crash
caget -d 29 gptest # Crash
ralphlange commented 5 years ago

@aawdls: do you also see the relation to the EPICS_CA_MAX_ARRAY_BYTES setting? @goetzpf: do you also see the correlation with the data type (CTRL or not)?

aawdls commented 5 years ago

Hi @ralphlange , thanks for your comment.

Yes I see the relation to EPICS_CA_MAX_ARRAY_BYTES the same as described by @goetzpf . This is what we observed in the field: we saw a production gateway go down whenever a certain waveform PV was viewed in an EDM screen through the gateway; the PV is from an IOC where EPICS_CA_MAX_ARRAY_BYTES was using the default value (16408) that was smaller than the waveform size; setting EPICS_CA_MAX_ARRAY_BYTES in the IOC to a sufficiently large value stops this happening.

I am able to reproduce it using @goetzpf 's recipe.

willrogers commented 5 years ago

Hi @ralphlange,

Is there an official owner for the ca-gateway, or is it managed by good-will contributions such as yours? I realise that just coordinating changes is a fair amount of work.

Would we be best to figure out a fix for ourselves and submit a pull request for review?

ralphlange commented 5 years ago

I am the reluctant de-facto maintainer. I'm happy to answer questions, review and coordinate suggestions, do the release procedures etc. but I don't see myself having enough time for active investigation of issues that do not currently show up in our own systems.

So: Yes, please go ahead and try to find a fix.

I did add a test setup for functional testing and a few tests (mainly to test the test setup) during one of my last active Gateway phases. If - as part of a fix - you could add a test showing the bug and preventing regression ... that would be perfect.

ralphlange commented 5 years ago

@goetzpf could you try to verify if this proposed fix works for you? Andy's reasoning sounds just right.