MDSplus / mdsplus

The MDSplus data management system
https://mdsplus.org/
Other
74 stars 44 forks source link

Some calling routines do not properly handle SsINTERNAL, C_ERROR, and/or FALSE returned by low-level routines #2741

Open mwinkel-dev opened 7 months ago

mwinkel-dev commented 7 months ago

Affiliation MIT PSFC

Version(s) Affected all

Platform all

Describe the bug The entire MDSplus code base uses different standards for error codes.

Often the status returned by a function is immediately checked with one of the following "define" macros.

However, the above macros only work with MDSplus codes, which consist of a 32-bit integer containing 3 fields (facility ID, message ID, and severity code). Note that the low-order bit is important. Success = 1 in the bit, Failure = 0 in the bit.

The problems arise when non-MDSplus codes are passed to the macros. Specifically, the -1 of SsINTERNAL, C_ERROR or an operating system error status. The value -1 = 0xFFFFFFFF has the low-order bit set, thus the macros will incorrectly treat -1 as a flavor of success. Which can cause error handling code to produce different results than what was intended when the code was originally written.

This mix of error code conventions is a systemic source of errors in the entire code base. It would probably be worthwhile to at least make sure that all routines that return SsINTERNAL or C_ERROR, are in turn called by functions that properly detect those return values and then map them into the appropriate MDSplus error codes.

To Reproduce This is an example that was spotted during the investigation of Issue #2731 and PR #2740. The low-level routine, send_bytes() can return SsINTERNAL. And potentially, it can ripple up the call stack all the way to the top level of mdstcl.

The following links are to lines of code starting in send_bytes() and then climbing up the call stack. send_bytes() SendMdsMsgC() SendArg() - bypasses error handling GetAnswerInfoTS() can also return SsINTERNAL - bypasses error handling ServerSendMessage() - bypasses error handling ServerDispatchAction() ServerDispatchPhase() - bypasses error handling TclDispatch_phase() - bypasses error handling

Because the macros treat SsINTERNAL as success, the error bypasses several error handling sections in the call chain. (The analysis of GetAnswerInfoTS is not shown above, but is similar to that for SendArg.)

Expected behavior Any function that can return SsINTERNAL, C_ERROR or other non-MDSplus codes, should be followed by a status check that also includes those values. In the above example, the following changes should be made.

The above example is the fifth or so instance of this bug that I have seen in the MDSplus source code. There are undoubtedly more bugs like this lurking in the source code. However, they are likely in seldom exercised error handling code sections. (Bugs like this that affected the mainstream workflow would have been reported by customers years ago and are surely already fixed.)

Screenshots n/a

Additional context There are three strategies for dealing with this issue.

  1. Don't Propagate -- Don't allow non-MDSplus error codes to ripple up the entire call chain. Limit it to just the first calling routine. Find all instances of SsINTERNAL and C_ERROR in the source code. Create a list of all the associated functions. Then create a second list of all routines that call the functions on the first list. Make sure that all functions on the second list correctly process SsINTERNAL and C_ERROR and only return MDSplus error codes. This is the safest approach, but also the most time consuming to implement.
  2. Redefine Macros -- Change the definition of the macros (IS_OK, IS_NOT_OK, etc) to also check for SsINTERNAL and C_ERROR. A simple and easy change, but is riskier because it is a global change.
  3. Redefine SsINTERNAL -- Redefine SsINTERNAL to be -2 (i.e. 0xFFFFFFFE) so that if it is inadvertently passed into the macros, it will be treated as a failure and not as a success. (This is the same suggestion made in the comments on PR 2740, https://github.com/MDSplus/mdsplus/pull/2740#discussion_r1563452509 .) Also simple and easy, but risky because it is a global change.

The "Don't Propagate" strategy should be feasible because the number of functions is fairly small.

mwinkel-dev commented 7 months ago

If the goal of SsINTERNAL is to alert the user to an extremely abnormal condition that should never arise during normal operation of MDSplus, then another approach is to throw an exception immediately. Instead of functions using return SsINTERNAL, replace those lines with a macro that displays an error message and then terminates the program.

The issue with the current code is that it masks errors by treating SsINTERNAL as a flavor of success. The current code "swallows" the SsINTERNAL and hides it from the user. The user is thus unaware that an abnormal condition occurred.

zack-vii commented 7 months ago

SsInternal is used to flag a situation that requires internal handling. i think it is used in remoteacces of treeshr to indicate a remote operation failed because of x is not supported so the code will try method y... something like that.

zack-vii commented 7 months ago

that said i agree the ok, error, true, and false should bbe consolidated tto follow one schema. but earlier attempts failed is it was either unclear if the methods are used by users (so we cannot chamge behaviour) and it was not always clear if 0 means success or false. we ended up replacing 0, 1 and -1 with the names you listed in order to help with that. however, noone dared to switch true/false to success/error or vice versa because we were afrait we would break code. today we have a better separation of public and private headers. so its worth checking if former concerns still hold

mwinkel-dev commented 7 months ago

Hi @zack-vii -- Thanks for the history.

If you would prefer to discuss this via a Zoom meeting, let me know and I'll arrange to be available at whatever time is convenient for you. (There are times when a complex topic is easier to handle with a discussion than with the written word.)

Reverting to the written word, I will try again to communicate the issue.

In my opinion, there is a difference between what we expect SsINTERNAL to do and what the code actually does.

There are three separate issues here.