Closed TBirkeHZB closed 3 years ago
Oh. my. STSACK_STRING.
The only client I know using that DBR is the good old ALH.
So... I can easily imagine, why this happens only at our place ;-) Although, I can't find any waveforms monitored by our alh-instances... but there might be others at HZB running alh somewhere...
Since you got your hands (arms, upper body,...) dirty already - do you have a suggestion for a fix?
Looking at the code, the intention is clearly to only increase the highestGddAppType
for CTRL requests:
unsigned at=dd.applicationType();
if (highestGddAppType < at) {
if (at >= gddDbrToAit[DBR_CTRL_SHORT].app && at <= gddDbrToAit[DBR_CTRL_DOUBLE].app) {
highestGddAppType = at;
gateDebug1(10, "gateVcData::read() increasing highestGddAppType to %u\n", highestGddAppType);
}
}
Does the Ait mapping break the order? Then I would have to compare on the DBR side.
In my VM (Base 3.15 master, file gddApps.h
), the order seems good:
CTRL_SHORT = 29, CTRL_DOUBLE = 34, STSACK_STRING = 35.
We didn't look that far into the code before... Weird thing is... Götz also reproduced the issue in R2-1-0 - so it exists for quite a while now. And it's not related to the aforementioned patch! It's just, that with this patch, the issue is present immediately as soon as all the disconnected clients re-connect (or there's one bad client here in our networks, we don't know of, that is just not running anymore - that'll be hard to track down).
Effectively disabling the patch at least shows no problems with clients re-connecting. Intentionally calling caget -d 37 ...
still breaks subscriptions to this PV.
Currently, we're limping into 2020s final user-opreation week with a patched version, that just doesn't show that issue immediatly (when connecting to all the clients running here).
So my issue description reduces to:
caget -d DBR_STSACK_STRING test:waveform
through a gateway breaks all subscriptions to this PV until the last client disconnects.
BTW: Has the mapping changed? caget -h
shows DBR_STSACK_STRING 37
here, but all the CTRL-types are indeed in range 28..34.
That mapping is done in generated code... The "aitTypes" are part of the infamous gdd module. Deprecated for more than a decade and for a reason.
The DBR codes have definitely not changed since ever - that would break Channel Access compatibility.
... that'll be hard to track down
Since use of DBR_STSACK_STRING is rare, you might be able to catch the client(s) with:
tshark -X lua_script:ca.lua -i any -Y ca.dtype==37
ca.lua comes from https://github.com/mdavidsaver/cashark
tshark
being the CLI front end for wireshark.
Götz came up with a path that improves the situation, but doesn't solve it. gateway-2.1.2.patch
We are seeing the same problem. However, I think that in our case it isn’t related to the DBR_STSACK_STRING type because we see this problem happening in a very small network where I am pretty sure that there is nothing using this type (the only client of which I am aware is a computer running Control System Studio). So maybe there are other conditions which can trigger this as well. I will try to find out what causes this in our case and let you know. Maybe this can help us in finding the relevant section of code.
I found a way to reliably reproduce the problem that we are seeing. I am not entirely sure whether it is the same problem, but at least it looks very similar. In our case, the problem can reliably be triggered by creating a subscription using the DBR_CTRL_DOUBLE type in addition to an existing subscruption using the DBR_TIME_DOUBLE type.
We only started seeing this problem recently and I wonder whether it might be related to us upgrading to EPICS Base 7.0.5, but on the other hand we initially only saw it happening in one network, so there might be other relevant factors.
Anyway, I now have a setup where I can reliably cause this problem, so debugging it should become a lot easier. Here is my test setup in case anyone else wants to dig into this:
I use a test IOC that generates a waveform with random data:
record(acalcout, "test:acalcout") {
field(CALC, "ARNDM")
field(NELM, "39")
field(SCAN, ".1 second")
field(FLNK, "test:aai")
}
record(aai, "test:aai") {
field(FTVL, "DOUBLE")
field(NELM, "39")
field(INP, "test:acalcout.AVAL")
}
Then, I run a gateway instance that connects to this IOC:
#!/bin/sh
EPICS_CA_ADDR_LIST=127.0.0.1
EPICS_CA_AUTO_ADDR_LIST=NO
EPICS_CAS_INTF_ADDR_LIST=127.0.0.1
EPICS_CAS_SERVER_PORT=15064
export EPICS_CA_ADDR_LIST EPICS_CA_AUTO_ADDR_LIST EPICS_CAS_INTF_ADDR_LIST EPICS_CAS_SERVER_PORT
/path/to/gateway \
-archive \
-no_cache \
-access access.conf \
-pvlist pvlist.conf
The access.conf
and pvlist.conf
are as simple as they can be.
access.conf
:
ASG(DEFAULT) {
RULE(1,READ)
}
pvlist.conf
:
EVALUATION ORDER ALLOW, DENY
.* ALLOW
I then use a little C program to connect to the channel and trigger the bug:
#include <stdio.h>
#include <unistd.h>
#include <cadef.h>
static void ctrlCallback(struct event_handler_args args) {
long elementCount = args.count;
const struct dbr_ctrl_double *value = args.dbr;
printf("DBR_CTRL_DOUBLE: [");
for (long i = 0; i < elementCount; ++i) {
if (i) {
printf(", ");
}
printf("%f", (&(value->value))[i]);
}
printf("]\n");
}
static void timeCallback(struct event_handler_args args) {
long elementCount = args.count;
const struct dbr_ctrl_double *value = args.dbr;
printf("DBR_TIME_DOUBLE: [");
for (long i = 0; i < elementCount; ++i) {
if (i) {
printf(", ");
}
printf("%f", (&(value->value))[i]);
}
printf("]\n");
}
int main(int argc, char **argv) {
if (argc != 2) {
fprintf(stderr, "Usage: gateway-test <channel name>\n");
return 1;
}
char const *channelName = argv[1];
int result = ca_context_create(ca_enable_preemptive_callback);
if (result != ECA_NORMAL) {
fprintf(
stderr,
"CA error %s occurred while trying to start channel access.\n",
ca_message(result)
);
return 1;
}
chid channelId;
result = ca_create_channel(channelName, NULL, NULL, 0, &channelId);
if (result != ECA_NORMAL) {
fprintf(
stderr,
"CA error %s occurred while trying to create the channel.\n",
ca_message(result)
);
return 1;
}
// Wait for channel to be connected.
while (ca_state(channelId) == cs_never_conn) {
ca_pend_io(0.1);
}
// We get the native element count and use it instead of requesting a
// dynamically size array. We do this because the same thing happens in the
// scenario which seems to trigger the bug in the gateway.
unsigned long elementCount = ca_element_count(channelId);
result = ca_create_subscription(
DBR_TIME_DOUBLE,
elementCount,
channelId,
DBE_VALUE | DBE_ALARM,
timeCallback,
NULL,
NULL
);
if (result != ECA_NORMAL) {
fprintf(
stderr,
"CA error %s occurred while trying to create subscription for type "
"DBR_TIME_DOUBLE.\n",
ca_message(result)
);
return 1;
}
// Sleep for five seconds before creating the second subscription in order to
// demonstrate the problem.
ca_pend_event(5.0);
// In our original scenario, the second subscription was created with an
// element count of one, not the native element count, but the bug seems to be
// triggered regardless of that, so in order to reduce the number of possible
// causes, we use the native element count here.
// Please note that we use a different event mask here (like we did in the
// original scenario), but this makes no difference either. The problem
// appears even when using the same event mask.
result = ca_create_subscription(
DBR_CTRL_DOUBLE,
elementCount,
channelId,
DBE_PROPERTY,
ctrlCallback,
NULL,
NULL
);
if (result != ECA_NORMAL) {
fprintf(
stderr,
"CA error %s occurred while trying to create subscription for type "
"DBR_CTRL_DOUBLE.\n",
ca_message(result)
);
return 1;
}
// Wait another five seconds in order to see the problem.
ca_pend_event(5.0);
ca_clear_channel(channelId);
return 0;
}
When running this program, I set EPICS_CA_ADDR_LIST
so that it connects to the gateway instead of the IOC:
EPICS_CA_ADDR_LIST=127.0.0.1:15064 EPICS_CA_AUTO_ADDR_LIST=NO ./gateway-test test:aai
Initially, the waveforms sent by the gateway are fine, but after creating the DBR_CTRL_DOUBLE subscription, we see that only zeros are sent, so we know that the bug has been triggered.
Next, I will try to find the cause of the bug in the gateway code, but with other things on my plate and the EPICS workshop next week, I don’t know yet how quickly I will be able to figure this out, so if someone else wants to join in, they are welcome.
I think I found the cause of this bug:
There is some logic in gateVcData::vcPostEvent
(gateVC.cc:1010
) that checks the highestGddAppType
and uses some special handling if this is the equivalent of one of the DBR_CTRL_
types. In this case, a GDD that includes the meta-data is constructed and gateVcData::copyState
is called.
The problem here is that copyState
first copies the meta-data from pv_data
and then copies the actual values from event_data
. While this makes sense, the smartCopy
function that is used to copy the event_data
only copies the first element if the value member of the destination GDD is a scalar. And the objects created for the dbr_ctrl_
app types use scalars by default…
@ralphlange This explains why your patch helps: It avoids the use of the different app types in these places and thus ensures that all elements are used. Of course, this comes at the price that meta-data won’t be passed on correctly any longer.
I also think that I found a possible solution: There is a function createDBRDD
in castStreamClient.cc
which creates a GDD corresponding to a certain DBR type (include the DBR_CTRL_
types) and widening it to be able to store a specified number of elements. I think that we can use the same logic in gateVc.cc
. Unfortunately, we will have to copy the code from PCAS bacause the described function is static
and not exported.
There is one detail that I still have to figure out: The copyState
function is also used in a few other places in gateVc.cc
. I am not sure yet whether we have the same problem there, but I think it is likely and would explain why Ralph’s patch doesn’t fix this problem completely. I will have a closer look at this and then I will try to write a patch that addresses all these issues.
(Götz' patch, not mine.)
Your analysis sounds reasonable and consistent. Kudos!
The Gateway is the main user of PCAS, the use of which has been deprecated for more than a decade. If you find a cleaner fix using an existing PCAS function, making that public in PCAS can be considered. (As long as you don't have to change its behaviour.)
Yes, you are right, it’s Götz’s patch. I guess it’s getting late and I should call it a day.
Regarding the function: I guess we could make it public in PCAS, but then this would also have to be changed in EPICS Base 3.15 or CA gateway wouldn’t compile against this version any longer, so maybe duplicating this code in CA gateway is the better option. In total, we are talking about ~ 100 lines of code here, so it might be justfiable in order to keep the CA gateway compatible with older releases of EPICS Base and the PCAS module.
As a quick solution, I am going to write a patch that includes this code anyway, so that I can get it quickly deployed to our CA gateways. We can then discuss whether we want to keep this as a permanent solution or whether we make the corresponding function in PCAS public and thus reduce the code size.
On a side note: After spending about 8 hours combing through the GDD / PCAS code, trying to understand what’s happening there, I fully understand why this code was deprecated. I wouldn’t recommend starting anything new with it either. 🙂
Thanks a lot for your efforts! Sorry for the GDD/PCAS experience. At least it's amusement tax free...
Once you have a candidate fix, please attach it here so the colleagues at BESSY can run it through their test setup. @TBirkeHZB, @goetzpf: Get ready...
I created PR #36 that addresses this issue and two additional issues that I found while fixing the original problem.
The patch reliably fixes the problem in my test environment. And I am going to roll it out on our production CA gateway in the next few hours (as soon as the build of the updated Ubuntu packages has completed).
@TBirkeHZB @goetzpf Could you please check that it also fixes the problem for you? My problem was slightly different than yours (it was triggered by using DBR_CTRL_*
instead of DBR_STSACK_STRING
), but I found that the copyState
function, where I fixed the problem, is also called by some code specifically handling DBR_STSACK_STRING
, so I am optimistic that your problem might be fixed as well.
For easier testing, here is a patch file that contains all the changes from the PR: ca-gateway-issue-33.patch.txt
If this doesn’t fix the problem completely, please apply the following patch in addition to the first patch:
ca-gateway-issue-33-extra.patch.txt
This patch addresses additional locations where smartCopy
is used. I am not sure whether this is needed, because this depends on both the source and destionation GDD types, but it could be that it is needed.
Please let me know whether the first patch alone, the first patch combined with the second patch, or neither of the patches fully fixes the issues that you have been experiencing.
@ralphlange Don’t worry about the GDD / PCAS code. I have been in software engineering long enough to have seen worse code (probably including some of my own code). 😉
BTW, I think I found the commit that introduced the bug regarding the DBR_CTRL
types. It probably was 90b24a75da26591b73bc200f146ecace0f74b5f9.
Might that actually also be the root cause of #11? Smells like that. And my fix for that made it worse. Nice.
No, I don’t think this was the root cause for #11. Your fix was based on the right idea, it just missed the DBR_GR types and introduced the bug regarding the arrays.
I guess the code could be simplified by always using one of the dbr_ctrl
GDD types because PCAS will take care of downconverting to the actually requested type. The only downside of this solution is that it would incur the extra overhead of using that complex type even if there is no monitor that requested it, so I think that your patch provides a nice optimization for this case.
BTW, I got some feedback from our users and it seems like my patch indeed fixes the problem (and doesn’t introduce any new ones, at least none that they found so far). The interesting question is now whether it also fixes the problems with DBR_STSACK_STRING
.
@TBirkeHZB @goetzpf Any chance that you can check the suggested fix?
@TBirkeHZB @goetzpf How much vacation do you guys have?
I tested ca-gateway-issue-33.patch.txt today (2021-08-30) and it fixes our problem with waveforms. I think you can add this patch to the codebase and create a new release.
When you create a new release, could you please also fix the version message the gateway displays at startup ?
Here is my patch file for this: ca-gateway-2.1.2-versionstring.txt
Greetings Goetz
Thanks, Götz. I will merge and create a new release (including the version string fix) when I have a minute.
@goetzpf: that version string fix was already done by Freddie Akeroyd in December 2019 and merged to trunk in January 2020.
After 4 weeks of running without problems, all of a sudden, an increasing number of subscriptions to waveforms (as well as subArrays, compress, ...) suddenly stopped working properly and only providing the client with the first value.
We suspected the problem to be related to our update to ca-gateway 2.1.2 and EPICS base 3.15.8.
First we could not reproduce the problem, but today Götz Pfeiffer ran a thorough test. Finally he found out, that the problem was introduces by patch 90b24a75da26591b73bc200f146ecace0f74b5f9 fixing #11
Once knowing, that it must be related to the DBR-type of a client, we managed to reproduce the problem.
camonitor
on that PV in one terminalcaget -d DBR_STSACK_STRING
of this PV in another terminalFrom step 3 on, the monitor will only post the first element of the waveform. Setting highestDbrType to
DBR_STSACK_STRING
breaks all active subscriptions.