PandABlocks / PandABlocks-server

TCP server exposing an ASCII interface to functional blocks
Apache License 2.0
2 stars 3 forks source link

Unexpected error when arming #57

Open coretl opened 2 months ago

coretl commented 2 months ago

Somehow we hit this TEST_OK while doing *PCAP.ARM: https://github.com/PandABlocks/PandABlocks-server/blob/1ffcc0a1803a38a22fb948ba32b7b75debe558a7/server/data_server.c#L220-L222

Causing ERR: Unexpected error at .../data_server.c:220. This was the ID panda that @glennchid upgraded, so I guess this was the 3.0 release, rather than @EmilioPeJu's latest changes, although I can't see anything in them that touches this code.

There are 2 parts of the code that connect to a PandA:

Inserting a 1 second sleep between the two fixes the problem. Inserting a 0.1s sleep does not.

I have yet to reproduce this outside the application they were using, but before I do, do you know what might trigger this?

Araneidae commented 2 months ago

There seems to be a state where the flag data_capture_enabled is false but the data_buffer is still active. This buffer has three states: STATE_IDLE, STATE_ACTIVE, and STATE_CLEARING, the last being used when a data capture from hardware is complete but data is still be transferred to a connected data user. The function read_buffer_status returns false when the buffer is in STATE_IDLE and true otherwise.

Hmm. I think this is probably a bug in the return value from read_buffer_status, as the next line after the failing test is testing the number of still connected clients. This function is only used twice, the second call is just a few lines further down used to implement *PCAP.STATUS?: https://github.com/PandABlocks/PandABlocks-server/blob/ee4c35271426aa949b2b7001edd367687f7524f6/server/data_server.c#L269-L275

I think the simplest fix is just

diff --git a/server/data_server.c b/server/data_server.c
index 37bba82..54cb2bc 100644
--- a/server/data_server.c
+++ b/server/data_server.c
@@ -251,7 +251,7 @@ error__t arm_capture(void)
                 "Data capture already in progress")  ?:
             /* If data capture is not enabled then we can safely expect the
              * buffer status to be idle. */
-            TEST_OK(!read_buffer_status(data_buffer, &readers, &active))  ?:
+            DO(read_buffer_status(data_buffer, &readers, &active))  ?:
             TEST_OK_(active == 0, "Data clients still taking data")  ?:
             /* Get PCAP ARM timestamp and store as a global */
             TEST_IO(clock_gettime(CLOCK_REALTIME, &pcap_arm_ts))  ?:

A more thorough fix might involve exposing all three states from this function, decoding them properly in get_capture_status, and expecting not STATE_ACTIVE in the read buffer call above, but I don't think this is worth doing.

In brief, I think your test should be failing with the message ERR: Data clients still taking data instead.