MDSplus / mdsplus

The MDSplus data management system
https://mdsplus.org/
Other
75 stars 44 forks source link

intermittent failure to connect to MDSplus server #2791

Closed vadim-at-te closed 3 months ago

vadim-at-te commented 5 months ago

Affiliation Tokamak Energy,173 Brook Dr, Milton, Abingdon, UK

Version(s) Affected SERVER_MAIN: MDSplus version 7.131-6 SERVER_BIG_DATA: MDSplus version 7.49-3

mdsplus clients - different versions. some 7.49-3, some 7.131-6

Platform(s) SERVER_MAIN: RHEL 8 ( 4.18.0-513.24.1.el8_9.x86_64 ) SERVER_BIG_DATA: Centos 7 ( 3.10.0-1160.118.1.el7.x86_64 ),

Installation Method(s) "yum" package manager + manual bug fixes for the SERVER_MAIN:

# MDSplus MATLAB interface (version 7.131-6 is installed on newsmaug as of writing this Wiki entry)
[vadim.nemytov@newsmaug ~]$ wget http://www.mdsplus.org/dist/rhel8/stable/RPMS/noarch/mdsplus-matlab-7.131-6.el8.noarch.rpm
[vadim.nemytov@newsmaug ~]$ sudo yum install mdsplus-matlab-7.131-6.el8.noarch.rpm
# to fix matlab's mdsput.m, need to set MDSplus_legacy_behavior=yes environment variable. It turns, the place to do is not /etc/profile but /etc/bashrc on RHEL 8 anyway
# insert export MDSplus_legacy_behavior=yes and export LD_PRELOAD=/usr/lib64/libcrypto.so.1.1 into /etc/profile.d/TE_profile.sh
[vadim.nemytov@newsmaug ~]$ tail -2 /etc/profile.d/TE_profile.sh
export MDSplus_legacy_behavior=yes
export LD_PRELOAD=/usr/lib64/libcrypto.so.1.1
# MDSplus libraries for compiling C-code that reads/writes to from/to MDSplus (e.g. include <mdslib.h>)
[vadim.nemytov@newsmaug ~]$ wget http://www.mdsplus.org/dist/rhel8/stable/RPMS/x86_64/mdsplus-devel_bin-7.131-6.el8.x86_64.rpm
[vadim.nemytov@newsmaug ~]$ wget http://www.mdsplus.org/dist/rhel8/stable/RPMS/noarch/mdsplus-devel-7.131-6.el8.noarch.rpm
[vadim.nemytov@newsmaug ~]$ sudo yum install mdsplus-devel_bin-7.131-6.el8.x86_64.rpm
[vadim.nemytov@newsmaug ~]$ sudo yum install mdsplus-devel-7.131-6.el8.noarch.rpm
#
# more MDSplus configuration
[vadim.nemytov@newsmaug ~]$ sudo cp /usr/local/mdsplus/etc/mdsplus.conf.template /etc/mdsplus.conf
# add these lines to /etc/mdsplus.conf
#
# Create CLASSPATH variable
#
CLASSPATH /usr/local/mdsplus/java/classes/jScope.jar:/usr/local/mdsplus/java/classes/jTraverser.jar:/usr/local/mdsplus/java/classes/jDevices.jar:/usr/local/mdsplus/java/classes/MDSobjects.jar:/usr/local/mdsplus/java/classes/jDispatcher.jar >;
[vadim.nemytov@newsmaug ~]$ sudo cp /usr/local/mdsplus/etc/mdsip.hosts /etc/
# added this line for gitlab-runner-server connections to /etc/mdsip.hosts
# *@10.12.1.53 | te.user
# also copy content from old smaug to have this:
[vadim.nemytov@newsmaug ~]$ cat /etc/mdsip.hosts
#* | MAP_TO_LOCAL
dt100* | te.user
*@192.168.2.* | te.user
*@192.168.1.* | te.user
*@10.0.40.* | te.user
*@10.0.41.* | te.user
*@acq2106* | te.user
*@glaurung | te.user
*@glaurung2| te.user
*@vis_spec | te.user
*@hnbi-control-pc | te.user
*@10.12.1.53 | te.user
* | nobody
[vadim.nemytov@newsmaug ~]$ sudo systemctl restart xinetd.service
# b/c of the new windows-one-for-all-aothentication server thing, people get mapped to "domain users". So assign mdsplus files to this group
[vadim.nemytov@newsmaug ~]$ sudo chgrp domain\ users -R /data/st40/mdsplus

# MDSplus - TDI functions written in python won't work as in, in general, need to point MDSplus to use python env that you want
# Note that on RHEL 8 there is no python in /usr/bin and in PATH in general. This is the first thing MDSplus TDI looks for.
# Long terms solution is to Dockerize MDSplus - not possible right now. See: https://www.sciencedirect.com/science/article/pii/S0920379620306694
# Intermediate solution - edit /usr/local/mdsplus/setup.sh to use your environment python libraries. You end up using system-wide python3.6 but
# you also find conda env libraries. Since conda env also happens to be python3.6 the whole thing somehow works.
# It is critical to define PyLib inside setup.sh and NOT place soft-links inside /usr/local/mdsplus/lib b/c the latter will break yum
# So it's ok to point "python" to the version we want without messing with the OS (e.g. yum uses python3 on RHEL 8)
# edit setup.sh to have this:
[vadim.nemytov@newsmaug ~]$ cat /usr/local/mdsplus/setup.sh
...
if [ -z "$PyLib" ]
then
  #pyver="$(python -V 2>&1)"
  pyver="$(/home/pcs.user/anaconda3/envs/ops_env_nonDocker/bin/python -V 2>&1)"

  if [ $? = 0 -a "$pyver" != "" ]
  then
    #PyLib=$(echo $pyver | awk '{print $2}' 2>/dev/null | awk -F. '{print "python"$1"."$2}' 2>/dev/null)
    PyLib=/home/pcs.user/anaconda3/envs/ops_env_nonDocker/lib/libpython3.6m.so
    if [ $? = 0 ]
    then
      doExport PyLib
    fi
  fi
fi
...

# For EFIT fortran MDSplus interface to work, need to change character 'c' to '!' in include/mdslib.inc. Save original:
1,2c1,2
< !    mdslib.inc
< !    Fortran include file for MdsLib
---
> c    mdslib.inc
> c    Fortran include file for MdsLib[vadim.nemytov@newsmaug ~]$ diff /usr/local/mdsplus/include/mdslib{,_original}.inc

# MDSplus numpy bug that isn't fixed in a stable release yet - only alpha. Peter's system-wide temporary fix:
[vadim.nemytov@smaug ~]$ grep ORIGINAL /usr/local/mdsplus/python/MDSplus/mdsscalar.py 
        if not isinstance(self._value, str): # ORIGINAL: _N.str):

the SERVER_BIG_DATA installation is similar

Describe the bug the acute problems is that diagnostic systems sporadically fail to write raw experimental data into MDSplus, risking the loss of valuable data. The error messages reported are either a) MDSplus.connection.MdsIpException: %MDSPLUS-E-Unknown, Error connecting to tcp://192.168.1.7:8000 OR b) %TREE-E-FAILURE, Operation NOT successful

from user experience, these two situations take place intermittently. Situation 1: ipython

from MDSplus import Connection conn = Connection('server-ip') FAIL

try again later - all good

Situation 2 - FAIL: conn = Connection('SERVER_MAIN') conn.treeOpen('TREE_THAT_LIVES_ON_SERVER_BIG_DATA', pulse_number) FAIL # try again later - and it works OK

Situation 2 - SUCCESS: conn = Connection('SERVER_BIG_DATA') conn.treeOpen('TREE_THAT_LIVES_ONSERVERBIG_DATA', pulse_number) SUCCESS

Situation 2 - SUCCESS-SOMETIMES:

adjust /etc/mdsplus.conf MDSIP_CONNECT_TIMEOUT 1 -> 10

conn = Connection('SERVER_BIG_DATA') conn.treeOpen('TREE_THAT_LIVES_ONSERVERBIG_DATA', pulse_number) FAILS SOMETIMES

adjust back how it was: /etc/mdsplus.conf MDSIP_CONNECT_TIMEOUT 10 -> 1

SUCCESS

To Reproduce We don't know how to reproduce it. Things tend to be ok when we are not operating the tokamak and activity is low. When we do operate, we dump large volume of data of many tokamak diagnostics systems to MDSplus, more-or-less at the same time. We've rewritten diagnostics systems based on matlab or python interface to connect directly to SERVER_BIG_DATA and that offer a work-around, however LabView legacy apps are too hard to change and those connect to SERVER_MAIN (to write data, ultimately, to SERVER_BIG_DATA) and those continue failing to write data sporadically. During operations the activity around reading MDSplus data also goes up b/c the team analysis experimental results to guide the subsequent plasma pulses.

Steps to reproduce the behavior:

  1. Go to '...'
  2. Click on '....'
  3. Scroll down to '....'
  4. See error

Expected behavior I expect to either always succeed or always fail to connect to MDSplus server. I expect to either always succeed or always fail to write to MDSplus. Instead, it's intermittent. Having verified tree paths and all that, ofcourse, I actually expect both actions to always succeed.

Screenshots image image

Additional context I have asked about this on MDSplus "discord" forum: https://discord.com/channels/935565750679273482/935565751513935955/1248288784684945489

vadim-at-te commented 3 months ago

As it turns out, it was a very silly (to put it lightly) fault internal to the company. We have a multitude of distinct uses of MDSplus, in addition to the classic use of storing experimental data. We also store simulated data, calibration data, pre-computed quantities (matrices) used in real-time for fast reconstruction, testing, experimental pulse preparation etc. All these activities take place in the designated "bandwidth" pulse range. The enforcement of the rules is not perfect and, recently, some data was written into experimental pulse range, confusing the in-house system that automatically locks (makes read-only) raw data trees 20 minutes after creation. Hence, we sometimes locked, unintentionally, some raw data trees faster than in 20 minutes and there was not enough time to finish writing raw data.

Really sorry to have taken your time on this. On the bright side, as part of your feedback and in-house troubleshooting we've gained the knowledge of how to set up alternative mdsip-based servers, e.g. with -s flag, via systemd instead of xinetd. We are now able to review for which systems a dedicated systemd-based mdsip server is appropriate and whith which flag.

WhoBrokeTheBuild commented 2 months ago

I'm glad to hear that you have it sorted out! Or at least have a plan to sort it out. Definitely keep us posted on how the dedicated systemd services work out for you, and don't hesitate to reach out if you run into any more issues.