epics-extensions / ca-gateway

Channel Access PV Gateway
http://www.aps.anl.gov/epics/extensions/gateway/
Other
17 stars 17 forks source link

subscriptions to waveforms suddenly break #33

Closed TBirkeHZB closed 3 years ago

TBirkeHZB commented 3 years ago

After 4 weeks of running without problems, all of a sudden, an increasing number of subscriptions to waveforms (as well as subArrays, compress, ...) suddenly stopped working properly and only providing the client with the first value.

We suspected the problem to be related to our update to ca-gateway 2.1.2 and EPICS base 3.15.8.

First we could not reproduce the problem, but today Götz Pfeiffer ran a thorough test. Finally he found out, that the problem was introduces by patch 90b24a75da26591b73bc200f146ecace0f74b5f9 fixing #11

Store the app type for DBR_CTRL requests then post complete GDDs (fixes #11)

Once knowing, that it must be related to the DBR-type of a client, we managed to reproduce the problem.

  1. create a test-PV with random waveform data updating e.g. once per second
  2. run a camonitor on that PV in one terminal
  3. do a caget -d DBR_STSACK_STRING of this PV in another terminal

From step 3 on, the monitor will only post the first element of the waveform. Setting highestDbrType to DBR_STSACK_STRING breaks all active subscriptions.

ralphlange commented 3 years ago

Oh. my. STSACK_STRING.

The only client I know using that DBR is the good old ALH.

TBirkeHZB commented 3 years ago

So... I can easily imagine, why this happens only at our place ;-) Although, I can't find any waveforms monitored by our alh-instances... but there might be others at HZB running alh somewhere...

ralphlange commented 3 years ago

Since you got your hands (arms, upper body,...) dirty already - do you have a suggestion for a fix?

Looking at the code, the intention is clearly to only increase the highestGddAppType for CTRL requests:

        unsigned at=dd.applicationType();
        if (highestGddAppType < at) {
            if (at >= gddDbrToAit[DBR_CTRL_SHORT].app && at <= gddDbrToAit[DBR_CTRL_DOUBLE].app) {
                highestGddAppType = at;
                gateDebug1(10, "gateVcData::read() increasing highestGddAppType to %u\n", highestGddAppType);
            }
        }

Does the Ait mapping break the order? Then I would have to compare on the DBR side.

ralphlange commented 3 years ago

In my VM (Base 3.15 master, file gddApps.h), the order seems good: CTRL_SHORT = 29, CTRL_DOUBLE = 34, STSACK_STRING = 35.

TBirkeHZB commented 3 years ago

We didn't look that far into the code before... Weird thing is... Götz also reproduced the issue in R2-1-0 - so it exists for quite a while now. And it's not related to the aforementioned patch! It's just, that with this patch, the issue is present immediately as soon as all the disconnected clients re-connect (or there's one bad client here in our networks, we don't know of, that is just not running anymore - that'll be hard to track down).

Effectively disabling the patch at least shows no problems with clients re-connecting. Intentionally calling caget -d 37 ... still breaks subscriptions to this PV.

Currently, we're limping into 2020s final user-opreation week with a patched version, that just doesn't show that issue immediatly (when connecting to all the clients running here).

So my issue description reduces to:

caget -d DBR_STSACK_STRING test:waveform through a gateway breaks all subscriptions to this PV until the last client disconnects.

BTW: Has the mapping changed? caget -h shows DBR_STSACK_STRING 37 here, but all the CTRL-types are indeed in range 28..34.

ralphlange commented 3 years ago

That mapping is done in generated code... The "aitTypes" are part of the infamous gdd module. Deprecated for more than a decade and for a reason.

ralphlange commented 3 years ago

The DBR codes have definitely not changed since ever - that would break Channel Access compatibility.

mdavidsaver commented 3 years ago

... that'll be hard to track down

Since use of DBR_STSACK_STRING is rare, you might be able to catch the client(s) with:

tshark -X lua_script:ca.lua -i any -Y ca.dtype==37

ca.lua comes from https://github.com/mdavidsaver/cashark

tshark being the CLI front end for wireshark.

ralphlange commented 3 years ago

Götz came up with a path that improves the situation, but doesn't solve it. gateway-2.1.2.patch

smarsching commented 3 years ago

We are seeing the same problem. However, I think that in our case it isn’t related to the DBR_STSACK_STRING type because we see this problem happening in a very small network where I am pretty sure that there is nothing using this type (the only client of which I am aware is a computer running Control System Studio). So maybe there are other conditions which can trigger this as well. I will try to find out what causes this in our case and let you know. Maybe this can help us in finding the relevant section of code.

smarsching commented 3 years ago

I found a way to reliably reproduce the problem that we are seeing. I am not entirely sure whether it is the same problem, but at least it looks very similar. In our case, the problem can reliably be triggered by creating a subscription using the DBR_CTRL_DOUBLE type in addition to an existing subscruption using the DBR_TIME_DOUBLE type.

We only started seeing this problem recently and I wonder whether it might be related to us upgrading to EPICS Base 7.0.5, but on the other hand we initially only saw it happening in one network, so there might be other relevant factors.

Anyway, I now have a setup where I can reliably cause this problem, so debugging it should become a lot easier. Here is my test setup in case anyone else wants to dig into this:

I use a test IOC that generates a waveform with random data:

record(acalcout, "test:acalcout") {
  field(CALC, "ARNDM")
  field(NELM, "39")
  field(SCAN, ".1 second")
  field(FLNK, "test:aai")
}

record(aai, "test:aai") {
  field(FTVL, "DOUBLE")
  field(NELM, "39")
  field(INP,  "test:acalcout.AVAL")
}

Then, I run a gateway instance that connects to this IOC:

#!/bin/sh

EPICS_CA_ADDR_LIST=127.0.0.1
EPICS_CA_AUTO_ADDR_LIST=NO
EPICS_CAS_INTF_ADDR_LIST=127.0.0.1
EPICS_CAS_SERVER_PORT=15064

export EPICS_CA_ADDR_LIST EPICS_CA_AUTO_ADDR_LIST EPICS_CAS_INTF_ADDR_LIST EPICS_CAS_SERVER_PORT

/path/to/gateway \
  -archive \
  -no_cache \
  -access access.conf \
  -pvlist pvlist.conf

The access.conf and pvlist.conf are as simple as they can be.

access.conf:

ASG(DEFAULT) {
  RULE(1,READ)
}

pvlist.conf:

EVALUATION ORDER ALLOW, DENY

.*                ALLOW

I then use a little C program to connect to the channel and trigger the bug:

#include <stdio.h>
#include <unistd.h>

#include <cadef.h>

static void ctrlCallback(struct event_handler_args args) {
  long elementCount = args.count;
  const struct dbr_ctrl_double *value = args.dbr;
  printf("DBR_CTRL_DOUBLE: [");
  for (long i = 0; i < elementCount; ++i) {
    if (i) {
      printf(", ");
    }
    printf("%f", (&(value->value))[i]);
  }
  printf("]\n");
}

static void timeCallback(struct event_handler_args args) {
  long elementCount = args.count;
  const struct dbr_ctrl_double *value = args.dbr;
  printf("DBR_TIME_DOUBLE: [");
  for (long i = 0; i < elementCount; ++i) {
    if (i) {
      printf(", ");
    }
    printf("%f", (&(value->value))[i]);
  }
  printf("]\n");
}

int main(int argc, char **argv) {
  if (argc != 2) {
    fprintf(stderr, "Usage: gateway-test <channel name>\n");
    return 1;
  }
  char const *channelName = argv[1];

  int result = ca_context_create(ca_enable_preemptive_callback);
  if (result != ECA_NORMAL) {
    fprintf(
      stderr,
      "CA error %s occurred while trying to start channel access.\n",
      ca_message(result)
    );
    return 1;
  }

  chid channelId;
  result = ca_create_channel(channelName, NULL, NULL, 0, &channelId);
  if (result != ECA_NORMAL) {
    fprintf(
      stderr,
      "CA error %s occurred while trying to create the channel.\n",
      ca_message(result)
    );
    return 1;
  }

  // Wait for channel to be connected.
  while (ca_state(channelId) == cs_never_conn) {
    ca_pend_io(0.1);
  }

  // We get the native element count and use it instead of requesting a
  // dynamically size array. We do this because the same thing happens in the
  // scenario which seems to trigger the bug in the gateway.
  unsigned long elementCount = ca_element_count(channelId);

  result = ca_create_subscription(
    DBR_TIME_DOUBLE,
    elementCount,
    channelId,
    DBE_VALUE | DBE_ALARM,
    timeCallback,
    NULL,
    NULL
  );
  if (result != ECA_NORMAL) {
    fprintf(
      stderr,
      "CA error %s occurred while trying to create subscription for type "
      "DBR_TIME_DOUBLE.\n",
      ca_message(result)
    );
    return 1;
  }

  // Sleep for five seconds before creating the second subscription in order to
  // demonstrate the problem.
  ca_pend_event(5.0);

  // In our original scenario, the second subscription was created with an
  // element count of one, not the native element count, but the bug seems to be
  // triggered regardless of that, so in order to reduce the number of possible
  // causes, we use the native element count here.
  // Please note that we use a different event mask here (like we did in the
  // original scenario), but this makes no difference either. The problem
  // appears even when using the same event mask.
  result = ca_create_subscription(
    DBR_CTRL_DOUBLE,
    elementCount,
    channelId,
    DBE_PROPERTY,
    ctrlCallback,
    NULL,
    NULL
  );
  if (result != ECA_NORMAL) {
    fprintf(
      stderr,
      "CA error %s occurred while trying to create subscription for type "
      "DBR_CTRL_DOUBLE.\n",
      ca_message(result)
    );
    return 1;
  }

  // Wait another five seconds in order to see the problem.
  ca_pend_event(5.0);

  ca_clear_channel(channelId);

  return 0;
}

When running this program, I set EPICS_CA_ADDR_LIST so that it connects to the gateway instead of the IOC:

EPICS_CA_ADDR_LIST=127.0.0.1:15064 EPICS_CA_AUTO_ADDR_LIST=NO ./gateway-test test:aai

Initially, the waveforms sent by the gateway are fine, but after creating the DBR_CTRL_DOUBLE subscription, we see that only zeros are sent, so we know that the bug has been triggered.

Next, I will try to find the cause of the bug in the gateway code, but with other things on my plate and the EPICS workshop next week, I don’t know yet how quickly I will be able to figure this out, so if someone else wants to join in, they are welcome.

smarsching commented 3 years ago

I think I found the cause of this bug:

There is some logic in gateVcData::vcPostEvent (gateVC.cc:1010) that checks the highestGddAppType and uses some special handling if this is the equivalent of one of the DBR_CTRL_ types. In this case, a GDD that includes the meta-data is constructed and gateVcData::copyState is called.

The problem here is that copyState first copies the meta-data from pv_data and then copies the actual values from event_data. While this makes sense, the smartCopy function that is used to copy the event_data only copies the first element if the value member of the destination GDD is a scalar. And the objects created for the dbr_ctrl_ app types use scalars by default…

@ralphlange This explains why your patch helps: It avoids the use of the different app types in these places and thus ensures that all elements are used. Of course, this comes at the price that meta-data won’t be passed on correctly any longer.

I also think that I found a possible solution: There is a function createDBRDD in castStreamClient.cc which creates a GDD corresponding to a certain DBR type (include the DBR_CTRL_ types) and widening it to be able to store a specified number of elements. I think that we can use the same logic in gateVc.cc. Unfortunately, we will have to copy the code from PCAS bacause the described function is static and not exported.

There is one detail that I still have to figure out: The copyState function is also used in a few other places in gateVc.cc. I am not sure yet whether we have the same problem there, but I think it is likely and would explain why Ralph’s patch doesn’t fix this problem completely. I will have a closer look at this and then I will try to write a patch that addresses all these issues.

ralphlange commented 3 years ago

(Götz' patch, not mine.)

Your analysis sounds reasonable and consistent. Kudos!

The Gateway is the main user of PCAS, the use of which has been deprecated for more than a decade. If you find a cleaner fix using an existing PCAS function, making that public in PCAS can be considered. (As long as you don't have to change its behaviour.)

smarsching commented 3 years ago

Yes, you are right, it’s Götz’s patch. I guess it’s getting late and I should call it a day.

Regarding the function: I guess we could make it public in PCAS, but then this would also have to be changed in EPICS Base 3.15 or CA gateway wouldn’t compile against this version any longer, so maybe duplicating this code in CA gateway is the better option. In total, we are talking about ~ 100 lines of code here, so it might be justfiable in order to keep the CA gateway compatible with older releases of EPICS Base and the PCAS module.

As a quick solution, I am going to write a patch that includes this code anyway, so that I can get it quickly deployed to our CA gateways. We can then discuss whether we want to keep this as a permanent solution or whether we make the corresponding function in PCAS public and thus reduce the code size.

On a side note: After spending about 8 hours combing through the GDD / PCAS code, trying to understand what’s happening there, I fully understand why this code was deprecated. I wouldn’t recommend starting anything new with it either. 🙂

ralphlange commented 3 years ago

Thanks a lot for your efforts! Sorry for the GDD/PCAS experience. At least it's amusement tax free...

Once you have a candidate fix, please attach it here so the colleagues at BESSY can run it through their test setup. @TBirkeHZB, @goetzpf: Get ready...

smarsching commented 3 years ago

I created PR #36 that addresses this issue and two additional issues that I found while fixing the original problem.

The patch reliably fixes the problem in my test environment. And I am going to roll it out on our production CA gateway in the next few hours (as soon as the build of the updated Ubuntu packages has completed).

@TBirkeHZB @goetzpf Could you please check that it also fixes the problem for you? My problem was slightly different than yours (it was triggered by using DBR_CTRL_* instead of DBR_STSACK_STRING), but I found that the copyState function, where I fixed the problem, is also called by some code specifically handling DBR_STSACK_STRING, so I am optimistic that your problem might be fixed as well.

For easier testing, here is a patch file that contains all the changes from the PR: ca-gateway-issue-33.patch.txt

If this doesn’t fix the problem completely, please apply the following patch in addition to the first patch:

ca-gateway-issue-33-extra.patch.txt

This patch addresses additional locations where smartCopy is used. I am not sure whether this is needed, because this depends on both the source and destionation GDD types, but it could be that it is needed.

Please let me know whether the first patch alone, the first patch combined with the second patch, or neither of the patches fully fixes the issues that you have been experiencing.

@ralphlange Don’t worry about the GDD / PCAS code. I have been in software engineering long enough to have seen worse code (probably including some of my own code). 😉

BTW, I think I found the commit that introduced the bug regarding the DBR_CTRL types. It probably was 90b24a75da26591b73bc200f146ecace0f74b5f9.

ralphlange commented 3 years ago

Might that actually also be the root cause of #11? Smells like that. And my fix for that made it worse. Nice.

smarsching commented 3 years ago

No, I don’t think this was the root cause for #11. Your fix was based on the right idea, it just missed the DBR_GR types and introduced the bug regarding the arrays.

I guess the code could be simplified by always using one of the dbr_ctrl GDD types because PCAS will take care of downconverting to the actually requested type. The only downside of this solution is that it would incur the extra overhead of using that complex type even if there is no monitor that requested it, so I think that your patch provides a nice optimization for this case.

BTW, I got some feedback from our users and it seems like my patch indeed fixes the problem (and doesn’t introduce any new ones, at least none that they found so far). The interesting question is now whether it also fixes the problems with DBR_STSACK_STRING.

ralphlange commented 3 years ago

@TBirkeHZB @goetzpf Any chance that you can check the suggested fix?

ralphlange commented 3 years ago

@TBirkeHZB @goetzpf How much vacation do you guys have?

goetzpf commented 3 years ago

I tested ca-gateway-issue-33.patch.txt today (2021-08-30) and it fixes our problem with waveforms. I think you can add this patch to the codebase and create a new release.

When you create a new release, could you please also fix the version message the gateway displays at startup ?

Here is my patch file for this: ca-gateway-2.1.2-versionstring.txt

Greetings Goetz

ralphlange commented 3 years ago

Thanks, Götz. I will merge and create a new release (including the version string fix) when I have a minute.

ralphlange commented 3 years ago

@goetzpf: that version string fix was already done by Freddie Akeroyd in December 2019 and merged to trunk in January 2020.