Twilight-Logic / AR488

AR488 Arduino GPIB Interface
GNU General Public License v3.0
215 stars 62 forks source link

Crash with binary data. #33

Open donnie-j opened 1 year ago

donnie-j commented 1 year ago

Thanks for the great work, I have a number of these devices connected to instruments here and there, incredibly useful.

I'm atm working with a Tektronix scope, trying to get binary data from it, using a atmega32u4 device (back of the connector board). ++ver AR488 GPIB controller, ver. 0.51.15, 12/10/2022

I'm seeing a reproducible crash once the transfer reaches a certain point. Several 10s of kB. It is not clear if it is data dependent (although there is nothing special about the place in the data stream where the crash occurs), byte or transfer count dependent (although it does happen at the same place with the same data), or something related to the instrument (e.g. timeout).

The symptom is at a certain point, the AR488 device will appear to hang, and then disappear from the USB bus. It leaves the MacOS CDC driver in an inconsistent state also. It is not certain that the instrument will crash, but it often hangs... causality of that is unclear.

What is a good way to debug this? That is, how have the developers been debugging such issues?

Thanks again in advance.

Twilight-Logic commented 1 year ago

Thanks for the report. Its an interesting problem. My first thought was whether its the Arduino hardware, so could I ask whether you have tried another 32u4 Micro board or even just simply another USB cable? (I am assuming the Arduino Micro from your description - 'back of the connector board' - and mention of 'atmega32u4')

It is curious that the USB bus is being de-stabilised. The interface does not buffer bytes incoming in from the GPIB bus but writes them directly to the serial port. In theory, the next byte would not be read from GPIB until Serial had completed writing the previous one. Presumably Serial would wait if the buffer was full which might result in a slight delay and possible timeout reading the GPIB bus. In that event, the GPIB transmission would simply stop and the instrument would simply be left waiting for the next read handshake. It would, in effect, 'hang' and just sit there waiting for the signal to send the next byte in its send buffer.

As to what may be de-stablising USB I am not sure, but since serial CDC is implemented in software, it could possibly be a serial buffer overrun or some other CDC implementation problem. If the serial CDC port were ro de-stabilise or 'disappear', then I imagine that would leave USB in an inconsistent state.

The default baud rate is set at 115200. One option might be to try a higher rate such as 128000 or 256000 to get faster throughput and see whether this helps.

You could also try enabling #define DEBUG_GPIBbus_RECEIVE and watching the transfer on a serial terminal (e.g. PuTTy or even Serial Monitor in the IDE). That might indicate whether the 'hang' occurs at a specific place or byte in the transmission. I would appreciate seeing the output of that (screenshot or preferably log) as it may indicate whether a specific byte is causing the problem.

It would also help if I could provide the settings you are using for the interface.

donnie-j commented 1 year ago

Thanks for the that, it's a pro micro clone. There is not much there, I'd expect larger problems if it was hardware. I'll try putting a scope on the power rails to see if driving the GPIB is causing a noise problem, but there seem to be a few tantalum decoupling capacitors on the board.

To try and narrow down what triggers the fault, I set it up to run 100,000 queries for 500 points of waveform (about 1700bytes) as fast as possible in ascii instead of binary. That runs about 50k cycles and then the AR488 code gets into a bad state and replies to any ++ command with 'Unknown command'. ++ver ++rst ++read, doesn't matter what. Clearly, there is memory corruption, and all bets are off at that point.

So I tried 100,000 queries for ++ver while disconnected from the instrument, and that seems to work. I'll try 1M... (edit: 1M ++ver reads as fast as possible succeeds without problem, about 2ms per read cycle).

Some test code, built on a wrapper that tries to emulate the Linux-GPIB API (and the Tek TDS firmware tool I'm trying to run... which works until it doesn't with AR488) is here in case you can see something stupid...

AR488 test code.zip

Twilight-Logic commented 1 year ago

Thank you for the update and testing. I will investigate and review the code, in particular the GPIB read routine. Maybe its a stack overflow problem or something. Curious that it can still reply to command even though it is doing so incorrectly.

donnie-j commented 1 year ago

Huh. I have a suspicion that this is a bus contention problem. After crashing, the 32u4 slowly immediately heats up. I think I should probably build another one.

Chased around refactoring the code for quite a while this evening. Found that changing PBSIZE (smaller or larger) caused it to be completely non functional. Suspected the local char line[PBSIZE] overflowing the stack... moving that off the stack seems to allow PBSIZE to be changed, or it might be coincidence ;)

Also found in GPIBbus::sendData

      // Otherwise ignore non-escaped CR, LF and ESC
      if ((data[i] != CR) || (data[i] != LF) || (data[i] != ESC)) err = writeByte(data[i], NO_EOI);

Which surely should be

      // Otherwise ignore non-escaped CR, LF and ESC
      if ((data[i] != CR) && (data[i] != LF) && (data[i] != ESC)) err = writeByte(data[i], NO_EOI);

Happy Holidays :)

Twilight-Logic commented 1 year ago

Also found in GPIBbus::sendData

  // Otherwise ignore non-escaped CR, LF and ESC
  if ((data[i] != CR) || (data[i] != LF) || (data[i] != ESC)) err = writeByte(data[i], NO_EOI);

Which surely should be

  // Otherwise ignore non-escaped CR, LF and ESC
  if ((data[i] != CR) && (data[i] != LF) && (data[i] != ESC)) err = writeByte(data[i], NO_EOI);

Yes, I think you are absolutely correct! Thank you for spotting that. It will be corrected.

Chased around refactoring the code for quite a while this evening. Found that changing PBSIZE (smaller or larger) caused it to be completely non functional. Suspected the local char line[PBSIZE] overflowing the stack... moving that off the stack seems to allow PBSIZE to be changed, or it might be coincidence ;)

The stack overflowing is a possibility I suppose if several commands were to be sent in quick succession and several instances of execCmd() were then queued up on the stack. There are only so many instances of a 128-byte buffer that a Uno or Micro with 2kb or 2.5kb of dynamic memory could handle, bearing in mind that 40-50% of memory is already being allocated at startup time. I am not sure why that would happen or why I ended up copying the parse buffer into a separate 'line' buffer within the execCmd() function but serial was implemented in a slightly different way in older versions. You could try substituting this code which avoids the additional buffer entirely:

/***** Execute a command *****/
void execCmd(char *buffr, uint8_t dsize) {

#ifdef DEBUG_CMD_PARSER
  DB_HEXB_PRINT(F("command received: "), buffr, dsize);
#endif

  // Its a ++command so shift everything two bytes left (ignore ++) and parse
  for (int i = 0; i < dsize-2; i++) {
    buffr[i] = buffr[i + 2];
  }

  // Replace last two bytes with a null (\0) character
  buffr[dsize - 2] = '\0';
  buffr[dsize - 1] = '\0';

#ifdef DEBUG_CMD_PARSER
  DB_HEXB_PRINT(F("sent to command processor: "), buffr, dsize-2);
#endif

  // Execute the command
  if (isVerb) dataPort.println(); // Shift output to next line
  getCmd(buffr);

  // Flush the parse buffer and clear ready flag
  flushPbuf();
  lnRdy = 0;

  // Show a prompt on completion?
  if (isVerb) showPrompt();
}

I tested it with a python script that runs a few commands on a HP34401A and it seemed to work OK. I would be interested to know whether it makes a difference.