DiamondLightSource / cothread

Cooperative Python Threads and EPICS Channel Access bindings
Apache License 2.0
12 stars 9 forks source link

Calling caget on a disconnected channel fails #17

Closed blondejamtart closed 4 years ago

blondejamtart commented 4 years ago

There doesn't seem to be any way to wait for the channel to reconnect.

Araneidae commented 4 years ago

@thomascobb , I'd be interested in your ideas for this. One option would be for caget/caput to always wait for reconnection ... but that would be a possibly incompatible change.

Edit: It looks like this is exactly what it does now, so whatever Brian is seeing is something else.

Araneidae commented 4 years ago

@btester271828 , I'm not able to reproduce this in a simple way. In my simple test, I do:

  1. caget from an existing PV
  2. close the IOC serving the PV
  3. caget the PV

It looks like cothread correctly waits for the PV to reconnect, timing out if this doesn't happen in time. Can you please give a small self contained demonstration of this problem?

Edit: It's possible that what we're seeing here is a race condition, with the PV disconnecting between caget waiting for the connection to complete and interrogating the the channel for its underlying data type. If so, I doubth this isn't fixable

blondejamtart commented 4 years ago

The specific case I had was:

  1. caget existing PV
  2. close the IOC serving it
  3. restart the IOC on a different TCP port
  4. caget the new PV
thomascobb commented 4 years ago

I think as long as it reconnected with the timeout then that would be fine...

blondejamtart commented 4 years ago

The following python script reproduces the error (will need tweaks for importing cothread/numpy):

import sys
import os
import subprocess
import time

from pkg_resources import require
require("numpy")
sys.path.append("/scratch/myr45768/Git/cothread")
from cothread import catools

# epics_base = '/scratch/myr45768/Git/epics-base'
epics_base = '/dls_sw/epics/R3.14.12.7/base'

softIoc_bin = epics_base + "/bin/linux-x86_64/softIoc"

# load some pre-existing template & define macros for it

db_template = '/dls/technical/controls/myr45768/pymalcolm/malcolm/modules/system/db/system.template'

stats = dict()
sys_call_bytes = open('/proc/%s/cmdline' % os.getpid(), 'rb').read().split(b'\0')
sys_call = [el.decode("utf-8") for el in sys_call_bytes]
stats["pymalcolm_path"] = os.path.abspath(sys_call[1])
stats["yaml_path"] = os.path.abspath(sys_call[2])
stats["yaml_ver"] = "bugMaker"

stats["pymalcolm_ver"] = "not pymalcolm"
hostname = os.uname()[1]
stats["kernel"] = "%s %s" % (os.uname()[0], os.uname()[2])
stats["hostname"] = hostname if len(hostname) < 39 else hostname[:35] + '...'
pid = os.getpid()
stats["pid"] = pid

simultaneous = 10

iocs = []
db_macros = []
for i in range(simultaneous):
    iocs += [None]
    db_macros += [None]

for i in range(len(iocs)):        
        db_macros[i] = "prefix='pc0111-BUG-R01-%02d'" % (i + 1)    
        for key, value in stats.items():
            db_macros[i] += ",%s='%s'" % (key, value)

# done defining db template, launch some IOCs!

for repeats in range(100):      

    print("Iteration %d" % repeats)
    for i in range(len(iocs)):        
         iocs[i] = subprocess.Popen(
            softIoc_bin + " -m " + db_macros[i] + " -d " + db_template + " 2> err", 
            stdout=subprocess.PIPE, stdin=subprocess.PIPE, shell=True)

    time.sleep(0.5)

    errored = False
    for i in range(len(iocs)):        
        val = catools.caget('pc0111-BUG-R01-%02d:PID' % (i + 1))

    for ioc in iocs:    
        ioc.terminate()
    time.sleep(0.5)
Araneidae commented 4 years ago

This looks like a race condition between checking whether the channel is connected and actually using it. With asynchronous CA handling (see commit 897a29fdd4fa0560b63e135139d86ae69557d7d3), avoiding this is fundamentally impossible.

Closing this as cannot sensibly fix.