epics-base / pvAccessCPP

pvAccessCPP is an EPICS V4 C++ module
https://epics-base.github.io/pvAccessCPP/
Other
9 stars 22 forks source link

invalid bit-set length error #158

Closed jjaraalm closed 4 years ago

jjaraalm commented 4 years ago

Hi,

I've run into a very annoying issue and I have no idea what's wrong. The simplest case has a PV structure like this:

four structure
    epics:nt/NTScalar:1.0 x1
    epics:nt/NTScalar:1.0 x2
    epics:nt/NTScalar:1.0 x3
    epics:nt/NTScalar:1.0 x4

Where each scalar looks like:

    epics:nt/NTScalar:1.0 x1
        int value 0
        time_t timeStamp <undefined>              
            long secondsPastEpoch 0
            int nanoseconds 0
            int userTag 0
        display_t display
            double limitLow 0
            double limitHigh 0
            string description 
            string format 
            string units 
        control_t control
            double limitLow 0
            double limitHigh 0
            double minStep 0

Now I try to write to one of them and I get an error

$ pvput -q four x4.value=1
Error: invalid bit-set length
...

However, if I have 3 or 5 scalars in the outer structure there are no issues. The problem also occurs with 8 and 16 scalars (although I haven't checked all numbers in between).

mdavidsaver commented 4 years ago

Can you link to, or attach, the server source?

jjaraalm commented 4 years ago

I'm using pvapy to host them. Here's a script to setup several test PVs. Also, I can get the error with 5 scalars, but it depends on which field I try to write to. For reference, I'm using epics-base from conda, currently version 7.0.3.1.

import pvaccess as pva

def create_scalar(value_type):

    display = pva.PvObject({
        'limitLow': pva.DOUBLE,
        'limitHigh': pva.DOUBLE,
        'description': pva.STRING,
        'format': pva.STRING,
        'units': pva.STRING
    }, 'display_t')

    control = pva.PvObject({
        'limitLow': pva.DOUBLE,
        'limitHigh': pva.DOUBLE,
        'minStep': pva.DOUBLE
    }, 'control_t')

    return pva.PvObject({
        'value': value_type,
        'timeStamp': pva.PvTimeStamp(),
        'display': display,
        'control': control
    })

def create_scalar_group(n):
    scalar = create_scalar(pva.INT)
    return pva.PvObject(dict(zip(['x'+str(i+1) for i in range(n)], [scalar]*n)))

if __name__ == "__main__":
    server = pva.PvaServer()

    for i in range(1, 5):
        pvname = 'test' + str(i)
        print("Creating PV: " + pvname)
        server.addRecord(pvname, create_scalar_group(i))

    input('Press Enter to quit.')
mdavidsaver commented 4 years ago

The error message appears to originate on the client side.

https://github.com/epics-base/pvAccessCPP/blob/03b01210211a4f74fa435556522381ab48342b69/src/remoteClient/clientContextImpl.cpp#L441 https://github.com/epics-base/pvAccessCPP/blob/03b01210211a4f74fa435556522381ab48342b69/src/remoteClient/clientContextImpl.cpp#L930-L934

Here pvPutBitSet is the mask of fields being written. m_bitSet is used to hold the mask received from the previous network op (INIT or GET). So it can quite legitimately be zero. It's not clear to me why this test is needed.

However, if I have 3 or 5 scalars in the outer structure there are no issues. The problem also occurs with 8 and 16 scalars (although I haven't checked all numbers in between).

It is suggestive that your NTScalar (which is missing alarm) has exactly 16 fields. So the overall composite has 1+16*4 or 65 fields, and thus 65 bits in the changed mask BitSet. 64 is a magic number in the BitSet wire encoding.

https://github.com/epics-base/pvDataCPP/blob/a1c0e432ee605c8abaa8f34185503c2886e12545/src/misc/bitSet.cpp#L352-L357

However, I'm unable to replicate this issue. Which versions of pvDataCPP, pvAccessCPP, and Base are involved?

Can you try making the following change and report what numbers are printed?

diff --git a/src/remoteClient/clientContextImpl.cpp b/src/remoteClient/clientContextImpl.cpp
index 97632395..ac64250b 100644
--- a/src/remoteClient/clientContextImpl.cpp
+++ b/src/remoteClient/clientContextImpl.cpp
@@ -927,6 +927,7 @@ public:
             return;
         }

+        std::cerr<<"XXX "<<pvPutBitSet->size()<<" < "<<m_bitSet->size()<<std::endl;
         if (pvPutBitSet->size() < m_bitSet->size())
         {
             EXCEPTION_GUARD3(m_callback, cb, cb->putDone(invalidBitSetLengthStatus, thisPtr));
jjaraalm commented 4 years ago

Thanks for looking into this so quickly. The issue started with a more complex structure. I removed alarms and tried to simplify the problematic PV down as much as possible to isolate the issue.

I've been using the binary distribution of 7.0.3.1 through conda.

I downloaded and build the 7.0.3.1 source from https://epics-controls.org/download/base/base-7.0.3.1.tar.gz and was able to reproduce it. This is supposed to include pvAccess 7.1.0 I believe.

With your logging and the server script above:

Success with 3:

$ ./pvput -q test3 x3.value=1
XXX 64 < 64

Failure with 4:

$ ./pvput -q test4 x4.value=1
XXX 64 < 128
Error: invalid bit-set length

Failure with 5:

$ ./pvput -q test5 x1.value=1
XXX 64 < 128
Error: invalid bit-set length

Success with 5:

$ ./pvput -q test5 x5.value=1
XXX 128 < 128

So if I understand this right, the client is trying to send only a partial structure and doesn't zero-pad the bitset of the partial structure (64) to the full size of the destination structure (128), which triggers the error.

I was able to reproduce this using a simpler structure with 64 int fields. Only the last one can be written to.

mdavidsaver commented 4 years ago

So if I understand this right, the client is trying to send only a partial structure and doesn't zero-pad the bitset of the partial structure (64) to the full size of the destination structure (128), which triggers the error.

This is my understanding as well, though it is actually more complicated than this since the same m_bitSet is updated if the client fetches the current value. pvget does this in order to have any enum choices list to map string to index.

So I think this error check is itself in error. Can you try to remove it?

In addition to manual testing, please also run the automatic tests with make runtests and report any failures.

diff --git a/src/remoteClient/clientContextImpl.cpp b/src/remoteClient/clientContextImpl.cpp
index 5fd7344a..a4f31bb4 100644
--- a/src/remoteClient/clientContextImpl.cpp
+++ b/src/remoteClient/clientContextImpl.cpp
@@ -925,12 +925,6 @@ public:
             return;
         }

-        if (pvPutBitSet->size() < m_bitSet->size())
-        {
-            EXCEPTION_GUARD3(m_callback, cb, cb->putDone(invalidBitSetLengthStatus, thisPtr));
-            return;
-        }
-
         if (!startRequest(m_lastRequest.get() ? QOS_DESTROY : QOS_DEFAULT)) {
             EXCEPTION_GUARD3(m_callback, cb, cb->putDone(otherRequestPendingStatus, thisPtr));
             return;
jjaraalm commented 4 years ago

Okay, is there a way to run tests for pvaccess only? I tried running all tests from base-7.0.3.1 but it bails out on libcom with:

epicsSockResolveTest.t ........ Bailout called.  Further testing stopped:  hostToIPAddr() is broken, testing not possible

I'm not sure how to set TOP correctly in the pvaccess module directory. I tried make runtests TOP=../.. and similar but could not get it to work.

mdavidsaver commented 4 years ago

Your ISP is hijacking DNS replies. The wonders of working from home?

A quick way to get past this is to comment out the line TESTS += epicsSockResolveTest in modules/libcom/test/Makefile.

jjaraalm commented 4 years ago

Commented that out and got it to run. All tests passed. I also manually tested against the test PVs I had without issue as well as the original ones that were causing problems.

mdavidsaver commented 4 years ago

I've removed this check with 4ef9c1102b422c0217cbb07c6d1a574f5baa4d83