MDSplus / mdsplus

The MDSplus data management system
https://mdsplus.org/
Other
71 stars 43 forks source link

IDL - mdsisclient can't check against socket=0 causing further issues when set_database claims !mdsdb_socket=0 #2625

Open sflanagan opened 12 months ago

sflanagan commented 12 months ago

Affiliation

DIII-D


Version(s) Affected

Current / Alpha Branch


Platform

Linux (RHEL8), IDL API (original bug), Python API (introduced bug)


Describe the bug

If set_database is called with the intention to proxy through a mdshost, then multiple issues occur if set_database receives the first socket id inside the IDL session (i.e. !MDSDB_SOCKET=0).

This appears to occur because:

Thus weird failure states happen when set_database claims socket id=0, resulting in mdsisclient being unable to confirm the socket via keyword_set, and then ultimately dsql and mdsvalue don't know how and/or where to execute SQL calls through TDI.


Examples

Note that my .sybase_login file for these examples define the mdshost='atlas-el8.gat.com', so we know whether a successful dsql call gets evaluated via the intended proxy (i.e. atlas-el8,gat.com), or via localhost (i.e. omega01.gat.com), or not at all.

1) The mdsip connection to the mdshost gets opened but the dsql still executes from the localhost:

[flanagan@omega01 ~]$ idl
IDL> set_database,'d3drdb'
IDL> help,!MDS_SOCKET,!MDSDB_SOCKET
<Expression>    INT       =       -1
<Expression>    INT       =        0
IDL> n = dsql('select host_name()',val) & print,val
omega01.gat.com

2) A subsequent mdsconnect to any server will prevent dsql from being able to execute:

[flanagan@omega01 ~]$ idl
IDL> set_database,'d3drdb'
IDL> mdsconnect,'atlas.gat.com'
IDL> help,!MDS_SOCKET,!MDSDB_SOCKET
<Expression>    INT       =        1
<Expression>    INT       =        0
IDL> n = dsql('select host_name()',val) & print,val
% DSQL: SET_DATABASE must preceed any DSQL calls
% PRINT: Variable is undefined: VAL.
% Execution halted at: $MAIN$

3) Until another set_database call claims a different non-zero socket id:

[flanagan@omega01 ~]$ idl
IDL> set_database,'d3drdb'
IDL> mdsconnect,'atlas.gat.com'
IDL> set_database,'d3drdb'
IDL> help,!MDS_SOCKET,!MDSDB_SOCKET
<Expression>    INT       =        1
<Expression>    INT       =        2
IDL> n = dsql('select host_name()',val) & print,val
atlas-el8.gat.com

Discussion

I tried fixing the logic in mdsisclient but it exposed other technical debt that I'm not eager to delve into.

Instead, I'm wondering if this scenario can be triaged by disallowing mdsplus sockets with id=0?

Would starting at socket id=1 be a simpler solution?


mwinkel-dev commented 12 months ago

Yes, the suggested fix is apt to be the simplest solution to avoid this IDL issue with keyword arguments. I anticipate that the fix will be easy, and that most of the effort will be setting up a test environment to confirm that the fix is indeed robust. Stephen has just assigned this issue to me, so I will start working on it immediately.

mwinkel-dev commented 12 months ago

Correction: the preceding post was too optimistic.

This bug is related to Issue #2580 and PR #2581 (i.e., that fix might have inadvertently caused this new bug). The investigation will thus be a bit more involved than initially imagined. The goal of course is to come up with a solution that fixes both IDL issues (and also doesn't break mdsip for non-IDL applications).

joshStillerman commented 12 months ago

I believe this has already been fixed previously. I will try to track down what happened to the fix This was addressed by 6eec092f4d5fac80367ce5f31cbfd6e5604a06d5

mwinkel-dev commented 12 months ago

Josh, Stephen and I have discussed both issues: #2580 (PR #2581) and this #2625. Although related, we realize that they are different issues. Investigation continues.

mwinkel-dev commented 11 months ago

The PR #2626 fix (start connection ids at 1) works fine if all servers can be upgraded to that software. However, that is impractical for most sites, because they will still have many severs with old software that issue connection ids starting at zero. Thus PR #2626 had to be rolled back.

The fix for this issue must thus involve only changes to the IDL code, and also must accept zero as a valid connection ID.

The approach taken for the second version of the fix is to replace IDL's problematic keyword_set(keyword_arg) statements with arg_present(keyword_arg). However this approach only works if the optional keyword argument is bound to a regular variable. If it is instead bound to a system variable, such as !MDSDB_SOCKET then the arg_present() fails to detect the optional keyword argument. Thus, the fix involves assigning system variables to regular variables, and then passing the regular variables via the optional keyword arguments.

This fix passes the entire MDSplus test suite. And also passed manual testing of the IDL code, as shown below.

Testing #2625

$ idl
IDL 8.6.0 (linux x86_64 m64).
(c) 2016, Exelis Visual Information Solutions, Inc., a subsidiary of Harris Corporation.

IDL> ; Testing issue #2625 with version 2.0 of the fix
IDL> set_database, 'logbook'
% Compiled module: SET_DATABASE.
% Compiled module: MDSVALUE.
% Compiled module: MDSCHECKARG.
% Compiled module: MDSISCLIENT.
% Compiled module: DSQL.
IDL> print, !MDSDB_SOCKET
       0
IDL> n = dsql('select host_name()', var) & print, var
% Compiled module: EVALUATE.
some_server
IDL> 
IDL> 
IDL> mdsconnect, 'some_server'
% Compiled module: MDSDISCONNECT.
IDL> print, !MDS_SOCKET
       1
IDL> n = dsql('select host_name()', var) & print, var
some_server
IDL> 
IDL> 
IDL> set_database, 'logbook'
IDL> print, !MDSDB_SOCKET
       2
IDL> n = dsql('select host_name()', var) & print, var
some_server
IDL> 

Testing #2580

$ idl
IDL 8.6.0 (linux x86_64 m64).
(c) 2016, Exelis Visual Information Solutions, Inc., a subsidiary of Harris Corporation.

IDL> ; Testing issue #2580
IDL> mdsconnect, 'some_server'
% Compiled module: MDSCONNECT.
% Compiled module: MDSDISCONNECT.
IDL> print, !MDS_SOCKET
       0
IDL> 

This fix also has a minor change to Connections.c to run an integrity check of the connection data structures when adding a new connection.

sflanagan commented 11 months ago

Regarding the ongoing attempts to fix the original issue in this thread...

As of build alpha_release-7-139-52 ...

The IDL API cannot open trees:

IDL> mdsconnect,'atlas.gat.com'
% Compiled module: MDSCONNECT.
% Compiled module: MDSDISCONNECT.
IDL> mdsopen,'nb',190000
% Compiled module: MDSOPEN.
% Compiled module: MDSVALUE.
% Compiled module: MDSCHECKARG.
% Compiled module: MDSISCLIENT.
% Variable is undefined: SOCKET_VAR.
% Execution halted at: MDSISCLIENT         7 /home/flanagan/git/mdsplus-builder/build/alpha_release-7-139-52/idl/mdsisclient.pro
%                      MDSVALUE           63 /home/flanagan/git/mdsplus-builder/build/alpha_release-7-139-52/idl/mdsvalue.pro
%                      MDSOPEN            49 /home/flanagan/git/mdsplus-builder/build/alpha_release-7-139-52/idl/mdsopen.pro
%                      $MAIN$
IDL>

The Py3 API cannot retrieve data:

>>> import MDSplus
>>> c = MDSplus.Connection('atlas.gat.com')
>>> c.openTree('nb',190000)
>>> print,c.get('TCL("show db",_output),_output')
(<built-in function print>, '000  NB            shot: 190000 [\\NB::TOP]   \n\n')
>>> data = c.get('\PINJ')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/flanagan/git/mdsplus-builder/build/alpha_release-7-139-52/python/MDSplus/connection.py", line 323, in get
    return self.conn.get(exp, *args, **kwargs)
  File "/home/flanagan/git/mdsplus-builder/build/alpha_release-7-139-52/python/MDSplus/connection.py", line 217, in get
    return retSerialized.deserialize()
  File "/home/flanagan/git/mdsplus-builder/build/alpha_release-7-139-52/python/MDSplus/mdsarray.py", line 339, in deserialize
    return _dat.Data.deserialize(self)
  File "/home/flanagan/git/mdsplus-builder/build/alpha_release-7-139-52/python/MDSplus/mdsdata.py", line 852, in deserialize
    return xd.value
  File "/home/flanagan/git/mdsplus-builder/build/alpha_release-7-139-52/python/MDSplus/descriptor.py", line 212, in value
    return self.descriptor.value
  File "/home/flanagan/git/mdsplus-builder/build/alpha_release-7-139-52/python/MDSplus/descriptor.py", line 181, in value
    return dtypeToClass[self.dtype].fromDescriptor(self)._setTree(self.tree)
  File "/home/flanagan/git/mdsplus-builder/build/alpha_release-7-139-52/python/MDSplus/compound.py", line 187, in fromDescriptor
    for i in _ver.xrange(d.ndesc)]
  File "/home/flanagan/git/mdsplus-builder/build/alpha_release-7-139-52/python/MDSplus/compound.py", line 187, in <listcomp>
    for i in _ver.xrange(d.ndesc)]
  File "/home/flanagan/git/mdsplus-builder/build/alpha_release-7-139-52/python/MDSplus/descriptor.py", line 50, in pointerToObject
    return Descriptor(pointer)._setTree(tree).value
  File "/home/flanagan/git/mdsplus-builder/build/alpha_release-7-139-52/python/MDSplus/descriptor.py", line 181, in value
    return dtypeToClass[self.dtype].fromDescriptor(self)._setTree(self.tree)
  File "/home/flanagan/git/mdsplus-builder/build/alpha_release-7-139-52/python/MDSplus/tree.py", line 1912, in fromDescriptor
    return cls(_C.cast(d.pointer, _C.POINTER(_C.c_int32)).contents.value, d.tree)
  File "/home/flanagan/git/mdsplus-builder/build/alpha_release-7-139-52/python/MDSplus/tree.py", line 1203, in __new__
    if str(node.usage) == "DEVICE":
  File "/home/flanagan/git/mdsplus-builder/build/alpha_release-7-139-52/python/MDSplus/tree.py", line 1567, in usage
    return _scr.String(str(self.usage_str)[10:])
  File "/home/flanagan/git/mdsplus-builder/build/alpha_release-7-139-52/python/MDSplus/tree.py", line 332, in getter
    return self._getNci(info)
  File "/home/flanagan/git/mdsplus-builder/build/alpha_release-7-139-52/python/MDSplus/tree.py", line 1310, in _getNci
    self.ctx, self._nid, _C.byref(item)))
  File "/home/flanagan/git/mdsplus-builder/build/alpha_release-7-139-52/python/MDSplus/mdsExceptions.py", line 94, in checkStatus
    raise exception
MDSplus.mdsExceptions.TreeNOT_OPEN: %TREE-W-NOT_OPEN, Tree not currently open
smithsp commented 11 months ago

Mention #2631

mwinkel-dev commented 11 months ago

The v3 fix implements Sean's suggestion to open an extraneous connection. Which means that when the set_database call is made, it will be assigned a non-zero socket, thus bypassing the flaw in the way IDL implements the keyword_set() function.

NOTE: The full IDL API was NOT tested for this fix. However, the fix has a small footprint, thus it is expected (hoped) that the rest of the API will be unaffected. Furthermore, manual testing included additional API commands.

The attached file shows the manual testing that was done for IDL. (Long boring story, but the Python API was not tested.) 2625_v3_testing.txt

mwinkel-dev commented 11 months ago

After reverting PR #2620's change to connection.py, we tested the Python API as per Sean's example above. Rolling back the change appears to have fixed the Python API bug.

sflanagan commented 11 months ago

Python reverts are looking good:

>>> import MDSplus
>>> c = MDSplus.Connection('atlas.gat.com')
>>> c.openTree('nb',190000)
>>> data = c.get('\PINJ')
>>> print,data.data()
(<built-in function print>, array([0., 0., 0., ..., 0., 0., 0.], dtype=float32))
>>>

Now we are getting closer to the starting point, and this time...

A set_database does indeed go through the intended mdshost proxy (i.e. atlas-el8.gat.com for my tests):

IDL> set_database,'d3drdb'
IDL> help,!MDSDB_SOCKET,!MDS_SOCKET
<Expression>    INT       =        1
<Expression>    INT       =        0
IDL> n = dsql('select host_name()',val) & print,val
atlas-el8.gat.com
IDL> n = dsql('select shot from shots where shot=190000',val) & print,val
      190000
IDL>

But there's still the original fail state where a subsequent mdsconnect breaks that existing mdshost proxied database connection:

IDL> set_database,'d3drdb'
% Compiled module: SET_DATABASE.
% Compiled module: MDSDISCONNECT.
% Compiled module: MDSVALUE.
% Compiled module: MDSCHECKARG.
% Compiled module: MDSISCLIENT.
% Compiled module: DSQL.
IDL> help,!MDSDB_SOCKET,!MDS_SOCKET
<Expression>    INT       =        1
<Expression>    INT       =        0
IDL> n = dsql('select host_name()',val) & print,val
% Compiled module: EVALUATE.
atlas-el8.gat.com
IDL> n = dsql('select shot from shots where shot=190000',val) & print,val
      190000
IDL> mdsconnect,'atlas.gat.com'
IDL> help,!MDSDB_SOCKET,!MDS_SOCKET
<Expression>    INT       =        1
<Expression>    INT       =        2
IDL> n = dsql('select host_name()',val) & print,val
% MDSVALUE: Error evaluating expression
       0
IDL> n = dsql('select shot from shots where shot=190000',val) & print,val
% MDSVALUE: Error evaluating expression
       0
IDL>

~Hmm... why did both !MDSDB_SOCKET and !MDS_SOCKET increment off of that subsequent mdsconnect call?~ EDIT: Ignore that question. My mistake by incorrectly reading the output

mwinkel-dev commented 11 months ago

Hi @sflanagan -- I am very embarrassed by yet another failure. This time, it was a failure of communication on my part to explain what is causing the database proxy to break.

The specific sequence you described was tested (but was not included in the 2625_v3_testing.txt file from the ~11:20 p.m. EDT post on 3-Oct-2023). My experiments earlier this week with the sequence you use showed that IDL is flaky (or at least not behaving as I expect).

The issue is that using print or help to examine the system variables, !MDS_SOCKET and !MDSDB_SOCKET, can occasionally break the database proxy. Why IDL behaves that way is a mystery to me.

The only way I could consistently preserve the database proxy was to omit the print and help statements.

Please repeat your tests, but do not use print or help to examine !MDS_SOCKET and !MDSDB_SOCKET. And let Stephen and me know if the database proxy remains intact after executing the mdsconnect.

If the proxy does work in that scenario, and if you are satisfied with the fix, then let Stephen know and he will merge it into the alpha branch.

This evening, I will repeat testing with your sequence of commands and will upload the test results in a new post.

Note: I am offline all day on Thursday because I am flying back to Colorado. (I was at MIT PSFC the first part of this week for meetings.). However, I will be working my usual hours on Friday.

I thank you for your patience.

Mention: @WhoBrokeTheBuild

EDIT / CLARIFICATION -- The odd side effects associated with print and help were seen with Fix v2. It might be that the arg_present() function used in that fix is sensitive to the side effects. If so, then the other fixes won't be affected as they don't use arg_present().

sflanagan commented 11 months ago

@mwinkel-dev

I reread the .txt file and now recognize the difference between our tests:

There are dsql calls in there, but I assume you get the drift...

Presumably the rdb connection is broken in both tests once that mdsconnect call is made.

By issuing a second set_database call, you're establishing a new connection to the rdb which appears to work.... as one would hope and expect, so that's good.

Users will, however, expect both use cases to work... which is reasonable. In fact, my test case is what we want users to actually do in their codebase(s) as a matter of best practice... connect once to the rdb, connect once to the mdsplus db, and leave both sockets open for reuse.

sflanagan commented 11 months ago

Please repeat your tests, but do not use print or help to examine !MDS_SOCKET and !MDSDB_SOCKET. And let Stephen and me know if the database proxy remains intact after executing the mdsconnect.

This idea has me curious.

Unfortunately, I am still able to reproduce the error.

See below...


Test 1 - revoke the help call to examine the system variables:

IDL> set_database,'d3drdb'
% Compiled module: SET_DATABASE.
% Compiled module: MDSDISCONNECT.
% Compiled module: MDSVALUE.
% Compiled module: MDSCHECKARG.
% Compiled module: MDSISCLIENT.
% Compiled module: DSQL.
IDL> n = dsql('select host_name()',val) & print,val
% Compiled module: EVALUATE.
atlas-el8.gat.com
IDL> n = dsql('select shot from shots where shot=190000',val) & print,val
      190000
IDL> mdsconnect,'atlas.gat.com'
IDL> n = dsql('select host_name()',val) & print,val
% MDSVALUE: Error evaluating expression
       0
IDL> n = dsql('select shot from shots where shot=190000',val) & print,val
% MDSVALUE: Error evaluating expression
       0

Test 2 - revoke the print call to show rdb result set:

IDL> set_database,'d3drdb'
% Compiled module: SET_DATABASE.
% Compiled module: MDSDISCONNECT.
% Compiled module: MDSVALUE.
% Compiled module: MDSCHECKARG.
% Compiled module: MDSISCLIENT.
% Compiled module: DSQL.
IDL> n = dsql('select shot from shots where shot=190000',val)
% Compiled module: EVALUATE.
IDL> mdsconnect,'atlas.gat.com'
IDL> n = dsql('select shot from shots where shot=190000',val)
% MDSVALUE: Error evaluating expression

EDIT...

Test 3 - revoke the first dsql call... i.e. do a back-to-back set_database and mdsconnect before issuing any dsql queries:

IDL> set_database,'d3drdb'
% Compiled module: SET_DATABASE.
% Compiled module: MDSDISCONNECT.
% Compiled module: MDSVALUE.
% Compiled module: MDSCHECKARG.
% Compiled module: MDSISCLIENT.
% Compiled module: DSQL.
IDL> mdsconnect,'atlas'
IDL> n = dsql('select shot from shots where shot=190000',val)
% MDSVALUE: Error evaluating expression
% Compiled module: EVALUATE.
mwinkel-dev commented 11 months ago

Hi @sflanagan -- And unfortunately, I am also able to reproduce the mdsvalue error. This is likely a latent bug that has been present in the IDL API for years. Am investigating it this evening and hope to have a fix for it later tonight. (And will also gather test results of the weird print and help issue that I saw a few days ago.)

Anyway, thanks for removing the print and help and posting the results. Much appreciated.

mwinkel-dev commented 11 months ago

There are two different failure modes: one is likely easy to fix, one is more complicated.

Simple

The mdsvalue error can be triggered immediately by using the following sequence.

set_database, my_db
n=dsql('select host_name()', val) & print, val
mdsconnect, my_server
n=dsql('select host_name()', val) & print, val

Repeating the above with mdsconnect, my_server, socket=0 also immediately triggers the error.

Puzzling

An easy way to avoid the "Simple" issue is to do the following.

set_database, my_db
n=dsql('select host_name()', val) & print, val
mdsconnect, my_server, socket=-1
n=dsql('select host_name()', val) & print, val

Unfortunately, repeating the "connect and query" a few times (usually around five times), eventually causes a mdsvalue error. Understanding and fixing this failure mode will take some time (i.e., I will investigate it while at the airport on Thursday waiting for my flight).

mwinkel-dev commented 11 months ago

Surprisingly, the mdsconnect routine was also doing a mdsdisconnect in one common scenario. (Apparently that was being done to create the !MDS_SOCKET system variable?). Commented out that mdsdisconnect and used defsysv to define !MDS_SOCKET. This change allows the database proxy to work even after some additional mdsconnect statements are issued (i.e. the "simple" case described in the previous post).

However, if continue issuing mdsconnect statements, the database proxy eventually breaks (i.e., the "puzzling" case described in the previous post).

NOTE - I have to catch a flight and will be offline for the rest of Thursday 5-Oct-2023. So did not have time to thoroughly investigate this issue. And make no guarantees regarding the new change.

mwinkel-dev commented 11 months ago

Fix v3 (PR 2632) was not robust. It failed to preserve the database proxy established by set_database. The root issue again was simply that IDL's keyword_set() has slightly different behavior than expected. So Fix v3 has been dropped. And now we have switched to Fix v4, which solves the problem by replacing the keyword_set() function.

Before diving into the details, first a brief overview of the fixes.

Now for the details on Fix v4 . . .

IDL's keyword_set() feature was designed to detect whether optional keyword arguments are "set" (non-zero) or not. This function is useful for dealing with keywords that enable / disable features.

When the MDSplus IDL API was written, the developers incorrectly assumed that keyword_set() was detecting the presence of the optional keyword. Which unfortunately it does not do when the optional keyword is present but has a zero value.

Fix v4 thus replaces all calls to keyword_set() with a new function, mds_keyword_set() that reliably detects the presence of an optional keyword argument, regardless of its value. Doing so meets the expectations of the rest of the code in the IDL API.

In addition to fixing the primary issue of Issue 2625, this fix also includes the following improvements:

However, Fix v4 has a maximum of 64 concurrent connections. A new issue will be filed to remove that limit, or at least understand what is causing it.

mwinkel-dev commented 11 months ago

Hi @sflanagan and @ModestMC -- Fix v4 is now available as PR 2635. Josh, Stephen and I all tested the fix. However, our testing was only on Ubuntu. (We do not presently have IDL running on a RHEL8 server.)

smithsp commented 11 months ago

@mwinkel-dev If you do not have access to our latest RHEL8 cluster, let's get you access.

mwinkel-dev commented 11 months ago

Hi @smithsp -- Thanks for suggesting that I obtain access to GA's RHEL8 server, in case it is needed for testing future PRs.

My understanding (perhaps incorrect) is that the situation is as follows:

mwinkel-dev commented 11 months ago

Hi @sflanagan and @ModestMC -- I just noticed that Josh reviewed and approved PR #2635 (aka Fix v4). So I have just merged it into the alpha branch.

Although today is a holiday at MIT PSFC, I will be checking my email occasionally throughout the day. Let me know if you encounter any problems with Fix v4.

mwinkel-dev commented 11 months ago

In the process of fixing Issue 2625, the investigation / testing uncovered these additional issues:

mwinkel-dev commented 8 months ago

This issue has also generated the following . . .

PR #2643, Issue #2650 and PR #2656.