Occasional ca connect issue

coretl commented 1 year ago

When doing a lot of ca_connects in a short space of time (1000ish in a few seconds) some of them take a longer than usual time to connect. I've had a go at reproducing this, and I have a fairly minimal reproducer:

import ctypes
import subprocess
import sys
import tempfile
import time
from queue import Queue

import pytest
from epicscorelibs.ca import cadef

NUM = 1000

@pytest.fixture(scope="module")
def ioc():
    cadef.ca_context_create(1)
    with tempfile.NamedTemporaryFile() as records:
        records.write(
            b"\n".join(b'record(longout, "LONGOUT%d") {\n}' % i for i in range(NUM))
        )
        records.flush()
        process = subprocess.Popen(
            [sys.executable, "-m", "epicscorelibs.ioc", "-d", records.name],
            stdin=subprocess.PIPE,
            stdout=subprocess.PIPE,
            stderr=subprocess.STDOUT,
            universal_newlines=True,
        )
        yield process
        try:
            process.communicate("exit()")
        except ValueError:
            # Someone else already called communicate
            pass

@pytest.mark.parametrize("iteration", [i for i in range(NUM)])
def test_get_pv(iteration, ioc):
    q = Queue()

    @cadef.connection_handler
    def put_on_queue(args):
        if args.op == cadef.CA_OP_CONN_UP:
            q.put(None)

    t = time.time()
    chid = ctypes.c_void_p()
    cadef.ca_create_channel(
        "LONGOUT%d" % iteration, put_on_queue, 0, 0, ctypes.byref(chid)
    )
    q.get()
    cadef.ca_clear_channel(chid.value)
    assert time.time() - t < 2

This will make an IOC with 1000 PVs, then do a ca_connect and wait for connection to each of those PVs serially. Most take about 0.1s to connect, but about 1 in 1000 will take between 2 and 9 seconds. Have you ever seen this @mdavidsaver ?

mdavidsaver commented 1 year ago

This may be a manifestation one of the long standing "gotchas" of libca. ca_create_channel() queues a request to connect, but does not always immediately flush. The intent is that on startup, a large client will issue many ca_create_channel(), then one ca_flush_io().

coretl commented 1 year ago

I just noticed I forgot to put the flushes in the test script, although they are there in aioca where the problem manifests itself, let me try again with flushes...

coretl commented 1 year ago

We still get the occasional connection time with the flushes inserted:

@pytest.mark.parametrize("iteration", [i for i in range(NUM)])
def test_get_pv(iteration, ioc):
    q = Queue()

    @cadef.connection_handler
    def put_on_queue(args):
        if args.op == cadef.CA_OP_CONN_UP:
            q.put(None)

    t = time.time()
    chid = ctypes.c_void_p()
    cadef.ca_create_channel(
        "LONGOUT%d" % iteration, put_on_queue, 0, 0, ctypes.byref(chid)
    )
    cadef.ca_flush_io()
    q.get()
    cadef.ca_clear_channel(chid.value)
    cadef.ca_flush_io()
    assert time.time() - t < 2

coretl commented 1 year ago

With these additional flush_io calls, we get less of the long connection times, but maybe 1 in 10000 is over the 2s threshold

coretl commented 1 year ago

We see this occasionally on beamlines with R3.14.12.7 too, ones that do thousands of ca connects (with cothread) at the start of their scripts occasionally have one that crosses the 5s threshold for reporting an error

mdavidsaver commented 1 year ago

Are there others CA servers/clients on the host/network where you are running this test? eg. is the test client receiving beacons from CA servers other than the IOC being run by the script?

Could you try isolating your test by setting:

export EPICS_CA_SERVER_PORT=15064
export EPICS_CA_REPEATER_PORT=15065
export EPICS_CAS_AUTO_BEACON_ADDR_LIST=NO
export EPICS_CA_AUTO_ADDR_LIST=NO
export EPICS_CA_ADDR_LIST=127.255.255.255

(maybe replacing 15064 and 15065 with any two UDP ports which are not used for CA traffic)

coretl commented 1 year ago

I'm inside a container running under podman using a bridge network:

$ podman network inspect podman
[
     {
          "name": "podman",
          "id": "2f259bab93aaaaa2542ba43ef33eb990d0999ee1b9924b557b7be53c0b7a1bb9",
          "driver": "bridge",
          "network_interface": "cni-podman0",
          "created": "2022-11-30T17:07:23.673996864Z",
          "subnets": [
               {
                    "subnet": "10.88.0.0/16",
                    "gateway": "10.88.0.1"
               }
          ],
          "ipv6_enabled": false,
          "internal": false,
          "dns_enabled": false,
          "ipam_options": {
               "driver": "host-local"
          }
     }
]

I've set those environment variables and run the tests again, I get 2 failures out of 10000

mdavidsaver commented 1 year ago

eg. is the test client receiving beacons from CA servers other than the IOC being run by the script?

Another possible scenario is if the newly started test IOC, after sending an initial beacon, receives a burst of search requests from other clients. This might result in some of the searches from the test client being dropped.

mdavidsaver commented 1 year ago

Running with 100000 iterations I see 3 iterations with time > 2.

FAILED caspam.py::test_get_pv[0] - assert (1669828549.3180916 - 1669828545.2171466) < 2
FAILED caspam.py::test_get_pv[82984] - assert (1669830814.1469836 - 1669830808.6978564) < 2
FAILED caspam.py::test_get_pv[93670] - assert (1669831114.0119646 - 1669831104.7841363) < 2

The second two instances have associated warnings from (I think) ca_clear_channel().

CA.Client.Exception...............................................
    Warning: "Virtual circuit disconnect"
    Context: "localhost:15064"
    Source File: modules/ca/src/client/cac.cpp line 1237
    Current Time: Wed Nov 30 2022 09:53:28.899190043

It might be interesting to restructure this script to record all of the times, to see if there is any trend.

mdavidsaver commented 1 year ago

There have long been occasional reports of (re)connect slowdowns w/ CA at scale. So far these have been anecdotal and/or not reproduced. While these may not all have the same underlying cause, it is useful to have a test case which is reproducible. (eventually. 100000 iterations takes me ~45 minutes to run)

caspam.py.txt

EPICS_CA_SERVER_PORT=15064 \
EPICS_CA_REPEATER_PORT=15065 \
EPICS_CAS_AUTO_BEACON_ADDR_LIST=NO \
EPICS_CA_AUTO_ADDR_LIST=NO \
EPICS_CA_ADDR_LIST=127.255.255.255 \
pytest caspam.py

@ralphlange @anjohnson fyi.

coretl commented 1 year ago

It might be interesting to restructure this script to record all of the times, to see if there is any trend.

I've done this:

import ctypes
import subprocess
import sys
import tempfile
import time
from queue import Queue

import matplotlib.pyplot as plt
import numpy as np
from epicscorelibs.ca import cadef

NUM = 10000

cadef.ca_context_create(1)
with tempfile.NamedTemporaryFile() as records:
    records.write(
        b"\n".join(b'record(longout, "LONGOUT%d") {\n}' % i for i in range(NUM))
    )
    records.flush()
    process = subprocess.Popen(
        [sys.executable, "-m", "epicscorelibs.ioc", "-d", records.name],
        stdin=subprocess.PIPE,
        stdout=subprocess.PIPE,
        stderr=subprocess.STDOUT,
        universal_newlines=True,
    )

    q = Queue()

    @cadef.connection_handler
    def put_on_queue(args):
        if args.op == cadef.CA_OP_CONN_UP:
            q.put(None)

    times = []

    for i in range(NUM):
        t = time.time()
        chid = ctypes.c_void_p()
        cadef.ca_create_channel("LONGOUT%d" % i, put_on_queue, 0, 0, ctypes.byref(chid))
        cadef.ca_flush_io()
        q.get()
        cadef.ca_clear_channel(chid.value)
        cadef.ca_flush_io()
        delta = time.time() - t
        times.append(delta)
        print(f"Iteration {i} took {delta}s", end="\r", flush=True)
        if delta > 2:
            print()

    times = np.array(times)
    print(f"Min: {np.min(times)}")
    print(f"Max: {np.max(times)}")
    print(f"Mean: {np.mean(times)}")
    print(f"Std: {np.std(times)}")
    plt.plot(times)
    plt.show()

    process.communicate("exit()")

coretl commented 1 year ago

Running it gives output:

CA.Client.Exception...............................................
    Warning: "Virtual circuit disconnect"
    Context: "9d5ddb748a21:5064"
    Source File: modules/ca/src/client/cac.cpp line 1237
    Current Time: Thu Dec 01 2022 12:04:10.499135081
..................................................................
Iteration 142 took 5.146530389785767s
CA.Client.Exception...............................................
    Warning: "Virtual circuit disconnect"
    Context: "9d5ddb748a21:5064"
    Source File: modules/ca/src/client/cac.cpp line 1237
    Current Time: Thu Dec 01 2022 12:05:14.448424270
..................................................................
CA.Client.Exception...............................................
    Warning: "Virtual circuit disconnect"
    Context: "9d5ddb748a21:5064"
    Source File: modules/ca/src/client/cac.cpp line 1237
    Current Time: Thu Dec 01 2022 12:05:48.958387129
..................................................................
Iteration 3533 took 6.636682033538818s
CA.Client.Exception...............................................
    Warning: "Virtual circuit disconnect"
    Context: "9d5ddb748a21:5064"
    Source File: modules/ca/src/client/cac.cpp line 1237
    Current Time: Thu Dec 01 2022 12:05:58.976466355
..................................................................
Iteration 3658 took 6.590296506881714s
CA.Client.Exception...............................................
    Warning: "Virtual circuit disconnect"
    Context: "9d5ddb748a21:5064"
    Source File: modules/ca/src/client/cac.cpp line 1237
    Current Time: Thu Dec 01 2022 12:06:33.814351332
..................................................................
CA.Client.Exception...............................................
    Warning: "Virtual circuit disconnect"
    Context: "9d5ddb748a21:5064"
    Source File: modules/ca/src/client/cac.cpp line 1237
    Current Time: Thu Dec 01 2022 12:06:37.253065722
..................................................................
Iteration 4757 took 8.310995101928711s
CA.Client.Exception...............................................
    Warning: "Virtual circuit disconnect"
    Context: "9d5ddb748a21:5064"
    Source File: modules/ca/src/client/cac.cpp line 1237
    Current Time: Thu Dec 01 2022 12:07:36.591295418
..................................................................
Iteration 6635 took 8.941535234451294s
CA.Client.Exception...............................................
    Warning: "Virtual circuit disconnect"
    Context: "9d5ddb748a21:5064"
    Source File: modules/ca/src/client/cac.cpp line 1237
    Current Time: Thu Dec 01 2022 12:07:54.173563448
..................................................................
CA.Client.Exception...............................................
    Warning: "Virtual circuit disconnect"
    Context: "9d5ddb748a21:5064"
    Source File: modules/ca/src/client/cac.cpp line 1237
    Current Time: Thu Dec 01 2022 12:08:06.336889937
..................................................................
Iteration 7353 took 9.19033432006836s
Min: 0.000649213790893554701755142211914s
Max: 9.19033432006836
Mean: 0.03215343375205994
Std: 0.18755087448609026

coretl commented 1 year ago

Figure_1 Zoomed out

coretl commented 1 year ago

Figure_2 Zoomed in

mdavidsaver commented 1 year ago

Thanks, I'm happy to see that there is no trend.

I don't have a good feel for what all could be causing these occasional delays. So the following is speculative.

The number of Virtual circuit disconnect warnings seems to be the same as the number of > 1s connect times. Since each of these channels should connect to the same IOC, and there is only one at a time. Maybe there is a race in the connection/circuit cache? eg. a subsequent ca_create_channel() sometimes finds no existing circuit, sometimes one ready for use, but occasionally picks up a partially closed circuit.

If so, then the delays should go away if you add a second channel, which is kept open throughout the entire test. Waiting for this one to connect would also synchronize with test IOC startup, removing the longer delay on the first iteration.

I also have old suspensions that there are problems with the search backoff timer logic of libca. This might manifest as discrete steps in the anomalous delays. However, I don't think ~9 samples is enough to show this one way or the other.

Maybe setting EPICS_CA_MAX_SEARCH_PERIOD would be interesting? I wonder if these anomalous delays can ever be longer than the maximum search period?

coretl commented 1 year ago

Interesting, keeping a channel open to the IOC for the duration makes it work reliably:

import ctypes
import subprocess
import sys
import tempfile
import time
from queue import Queue

import matplotlib.pyplot as plt
import numpy as np
from epicscorelibs.ca import cadef

NUM = 100000

if __name__ == "__main__":
    cadef.ca_context_create(1)
    with tempfile.NamedTemporaryFile() as records:
        records.write(
            b"\n".join(b'record(longout, "LONGOUT%d") {\n}' % i for i in range(NUM))
        )
        records.flush()
        process = subprocess.Popen(
            [sys.executable, "-m", "epicscorelibs.ioc", "-d", records.name],
            stdin=subprocess.PIPE,
            stdout=subprocess.PIPE,
            stderr=subprocess.STDOUT,
            universal_newlines=True,
        )
        try:
            q = Queue()

            @cadef.connection_handler
            def put_on_queue(args):
                if args.op == cadef.CA_OP_CONN_UP:
                    q.put(None)

            def connect(pv_name):
                chid = ctypes.c_void_p()
                cadef.ca_create_channel(pv_name, put_on_queue, 0, 0, ctypes.byref(chid))
                cadef.ca_flush_io()
                q.get()
                return chid

            def disconnect(chid):
                cadef.ca_clear_channel(chid.value)
                cadef.ca_flush_io()

            times = []
            chid0 = connect("LONGOUT0")

            for i in range(1, NUM):
                t = time.time()
                chid = connect("LONGOUT%d" % i)
                disconnect(chid)
                delta = time.time() - t
                times.append(delta)
                print(f"Iteration {i} took {delta}s", end="\r", flush=True)
                if delta > 2:
                    print()

            disconnect(chid0)

            times = np.array(times)
            print(f"Min: {np.min(times)}")
            print(f"Max: {np.max(times)}")
            print(f"Mean: {np.mean(times)}")
            print(f"Std: {np.std(times)}")
            plt.plot(times)
            plt.show()
        finally:
            process.communicate("exit()")

What might this indicate?

The docs on EPICS_CA_MAX_SEARCH_PERIOD suggest the minimum value is 60s. Is this correct? I've certainly never seen delays that long before, longest I've seen is about 11s. What do you suggest I set it to?

mdavidsaver commented 1 year ago

Interesting, keeping a channel open to the IOC for the duration makes it work reliably: ... What might this indicate?

I think this points towards a problem with the connection/circuit phase, and away from a problem with search phase. So there is probably no need to try changing EPICS_CA_MAX_SEARCH_PERIOD.

My working theory is that a ca_clear_channel() for the last channel associated with a connection/circuit one starts the process of closing down a circuit, but doesn't wait for this process to finish (both worker threads to exit). So after a subsequent ca_create_channel(), the search phase might sometimes complete and find the old dead/dying circuit.

This theory is mostly informed by my experience with pvAccessCPP which had, and still has, some similar problems with worker synchronization when closing channels. So still a bit of a guess.

epics-base / epicscorelibs

Occasional ca connect issue #16